Question: Design and implement a Perl program called ngram.pl that will learn an N-gram language model from an arbitrary number of plain text files. Your program

Design and implement a Perl program called ngram.pl that will learn an N-gram language model from an arbitrary number of plain text files. Your program should generate a given number of sentences based on that N-gram model. See the discussion on pages 92-94 of JM for further details.

Your program should work for any value of n, and should output m sentences. Before learning the N-gram model, convert all text to lower case, and make sure to include punctuation in the n-gram models. Separate punctuation from words before learning the N-gram model. Your program should learn a single n-gram model from any number of input files.

As a benchmark for performance, your program should be able to generate results for a trigram model (n=3) based on 1,000,000 words (tokens) of text in under five minute.

Your program should run as follows:

ngram.pl n m input-file/s

n and m should be integer values, and input-file/s should be a list of one or more file names that contain the text you are building your ngram model from. For example you could run your program like this :

ngram.pl 3 10 pg2554.txt pg2600.txt pg1399.txt

This command should result in 10 randomly generated sentences based on a tri-gram model learned from these three files (which happen to be Crime & Punishment, War & Peace, and Anna Karenina from Project Gutenberg. You can find plain text versions of many literary works at Project Gutenberg (http://www.gutenberg.org) You may find it interesting to develop your program using works from an author you are familiar with and enjoy reading. You may use whichever files you wish from Project Gutenberg, but make certain the total number of tokens in all your files is more than 1,000,000.

Your program should be designed so that any number of input files (1 or more) can be processed by your program without modification. You should *not* hard code file names or the number of files your program is able to process.

Make sure that you separate punctuation marks from text and treat them as tokens. Also treat numeric data as tokens.

So, in a sentence like :

my, oh my, i wish i had 100 dollars.

You should have 12 tokens :

my , oh my , i wish i had 100 dollars .

Your program will need to identify sentence boundaries, and your ngrams should *not* cross these boundaries. For example, you could have input like this:

He went down the stairs

and then out the side door.

My mother and brother

followed him.

You should treat this as two sentences, as in:

He went down the stairs and then out the side door .

My mother and brother followed him .

To identify sentence boundaries, you may assume that any period, question mark, or exclamation point represents the end of a sentence. (In general this assumption is not correct in all cases, but is perfectly adequate for our purposes here. It is ok if your sentence boundary detection isn't completely accurate.) When generating a sentence, keep going until you find a terminating punctuation mark. Once you observe that then the sentence is complete.

If the length of a sentence in the input text file is less than n, then you may simply discard that sentence and not use it when computing n-gram probabilities.

Your program should output an informative message as a first line, stating what this program is and who is the author. Then it should output the value of the command line options, followed by the m sentences generated by your program.

For example....

%perl ngram.pl 2 10 pg2554.txt pg2600.txt pg1399.txt

This program generates random sentences based on an Ngram model.

Command line settings : ngram.pl 2 10

[followed by 10 random sentences]

Please submit a hard copy of your source code file ngram.pl and a hard-copy of single script file called ngram-log.txt that you create as follows:

%script ngram-log.txt

%time perl ngram.pl 1 10 pg2554.txt pg2600.txt pg1399.txt

%time perl ngram.pl 2 10 pg2554.txt pg2600.txt pg1399.txt

%time perl ngram.pl 3 10 pg2554.txt pg2600.txt pg1399.txt

%time perl ngram.pl 4.10 pg2554.txt pg2600.txt pg1399.txt

%exit

Note that you can download Project Gutenberg files via your web browser (http://www.gutenberg.org) or the wget command. You will want to save the .txt versions of files since the others (html etc) contain a lot of mark-up (html, etc) that would affect your models. You may use code as found in perldoc, Learning Perl, or Programming Perl as a part of your assignments, however, this must be attributed in your source code. You may also use modules from CPAN if they are not NLP specific.

Make sure to review the programming assignment grading rubric to see how points will be distributed on each assignment. Your program should be commented such that I can understand the overall design and detailed workings of your program.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Design and implement a Python program called ngram.py that will learn an N-gram language model from an arbitrary number of plain text files. Your program should generate a given number of sentences...

FIles from tar gz: 1) func_desc.txt: Built in functions information: We will give a brief description of the merely built in functions you are allowed to use. You can find information on most of the...

Write a perl program called tagger.pl which will take as input a training file containing part of speech tagged text, and a file containing text to be part of speech tagged. Your program should...

ICS 141. Programming with Objects Program 4 Due: 3/11/2018 Name: Points: 40 Goal: To write a program, that uses abstract classes and interfaces; uses text menu to interact with users; to learn the...

Goal: To write a program, that uses abstract classes and interfaces; uses text menu to interact with users; to learn the temporary reference pointer casting (please refer to Week 4 lecture notes for...

You are given a specification of database tables for a bookstore. Here are your tasks: PROBLEM: You are given a specification of database tables for a bookstore. Here are your tasks: 1. Copy into...

(1) a UML Class Diagram ***[QUESTION 1]*** - (2) a UML Sequence Diagram ***[QUESTION 2] (only answer this part now if you were the one that answered QUESTION 1 on my other Post) Include all of your...

Tasks The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add test code to a new JUnit test...

Hi I need help with this project that I am doing. It has to be in C language and I don't what to do. This is for my Data Structure course. Please it has to be in Language of C. Programming Assignment...

Use the variable Unemployment Rate: Aged 15 and Over: All Persons for Canada, Seasonally Adjusted (available at this link) to compute the actual change in the unemployment rate in Canada during 2019....

Panther Company reported the following as of and for its year ended December 31, 2020 (in thousands) Cash used in operating activities $(1,640) Capital expenditures $ 550 Cash on hand $ 600 Based on...

QUESTION 3 What is the purpose of the Domain Name System ( DNS ) ? DNS manages user authentication and authorization for accessing online services and resources. DNS encrypts data transmission...

Calculate the minimum and maximum takt time for the given

2. How did business and political interests influence what early intercultural communication researchers studied and learned?

4. Explain the strengths and weaknesses of each approach.

1. Identify four early foci in the development of intercultural communication.