Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

You will build your language model from a given set of example texts. As the model is based on trigram counts, you must count how

You will build your language model from a given set of example texts. As the model is based on trigram counts, you must count how many times triples of consecutive words appear in each example text. Words should be treated case-sensitively, meaning "she" and "She" should be considered two different words. And, although the example texts may contain punctuation, you should not treat it specially. That is, if the file contains the phrase "he, she, I", then you can consider the first word as "he,", the second as "she," and the third as "I". Said another way, process your example files as if they contained no punctuation, and consider the two words "she" and "she," as two different words.

You must write a C++ program which when built, creates an executable file named hw7a that takes two command-line arguments. The first argument is the name of a text file containing a list of input filenames.

In order to treat the beginning and end of your example files meaningfully during Part B, you will include in the model you create in Part A the special words "", "" (to indicate the start of each document), and "", "" (to indicate the end of each document). In particular, suppose your example text begins with words a b and ends with words c d. Then you must add into your model the four trigrams "", , a

, a, b c, d, ""

d, "",

And you will need to add four similar trigrams for each example text that you process.

Each time your program is run, it should build your trigram-based language model by processing each text file specified in the input filename list. What happens after that will depend on the second argument specified at the command line. The second argument is a single letter, and should be one of "a", "r", or "c". Your program should output to the C++ standard output stream (cout) the language model you created, ordering entries as specified by the argument letter as follows:

a - forward alphabetical order. This means that trigrams are output in alphabetical order by the first word in each trigram, using the alphabetical order of the second and then third word in each trigram to break ties.

r - reverse alphabetical order. This means that trigrams are output in descending alphabetical order by the first word in each trigram, using the descending alphabetical order of the second and then third word in each trigram to break ties.

c - count order. The means that trigrams are output in ascending order by frequency, using forward alphabetical ordering of first words and then second and then third words to break ties.

Your output will consist of one trigram with associated count per line. On a given line, the 4 outputs (trigramWord1, trigramWord2, trigramWord3, and count) should be separated by single spaces.

Example

Suppose the list of training texts input for your program resides in a file named tiny_ex.txt, and the contents of the file are names of text files containing excerpts from Dr. Seuss books as follows (click the links to see the contents of the two text files): sl.txt

ge.txt

For the command ./hw7a tiny_ex.txt a, the expected output is:

I 1

theyve 1

I do 1

theyve talked 1

Clause. 1

I do not 2

Santa Clause. 1

a lot about 1

about flaws. theyve 1

about gauze. theyve 1

about laws and 1

about old Santa 1

about paws and 1

and theyve talked 2

anywhere 1

do not like 2

flaws. theyve talked 1

gauze. theyve talked 1

here or there 1

laws and theyve 1

like them anywhere 1

like them here 1

lot about old 1

not like them 2

old Santa Clause. 1

or there I 1

paws and theyve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere 1

them here or 1

there I do 1

theyve talked about 4

theyve talked quite 1

For the command ./hw7a tiny_ex.txt c, the expected output is:

I 1

theyve 1

I do 1

theyve talked 1

Clause. 1

Santa Clause. 1

a lot about 1

about flaws. theyve 1

about gauze. theyve 1

about laws and 1

about old Santa 1

about paws and 1

anywhere 1

flaws. theyve talked 1

gauze. theyve talked 1

here or there 1

laws and theyve 1

like them anywhere 1

like them here 1

lot about old 1

old Santa Clause. 1

or there I 1

paws and theyve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere 1

them here or 1

there I do 1

theyve talked quite 1

I do not 2

and theyve talked 2

do not like 2

not like them 2

theyve talked about 4

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Climate And Environmental Database Systems

Authors: Michael Lautenschlager ,Manfred Reinke

1st Edition

1461368332, 978-1461368335

More Books

Students also viewed these Databases questions

Question

4. What are the current trends in computer software platforms?

Answered: 1 week ago