[Solved] Corpus files: English: https://drive.goog

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Corpus files: English: https://drive.google.com/file/d/1iYpf3brEDu8ePDFGXHuATC5YaHNrYHr9/view?usp=sharing French: https://drive.google.com/file/d/1KvCKDp8Uk1XUi_q7FLaawfvptfM5fI2F/view?usp=sharing Spanish: https://drive.google.com/file/d/1UCLzxabK4Exn_NB0ud35yO3rqgMBgE5W/view?usp=sharing Hip Hop: https://drive.google.com/file/d/1FdC8tXC0VNFnq3vjrtCj42-4mf_phqLj/view?usp=sharing Lisp: https://drive.google.com/file/d/1elKcbNZEGdQUMDCWK_bqo67Te5pxCjq-/view?usp=sharing Problem: Write a Java class named SourceModel reads a file containing a

Corpus files:

English: https://drive.google.com/file/d/1iYpf3brEDu8ePDFGXHuATC5YaHNrYHr9/view?usp=sharing

French: https://drive.google.com/file/d/1KvCKDp8Uk1XUi_q7FLaawfvptfM5fI2F/view?usp=sharing

Spanish: https://drive.google.com/file/d/1UCLzxabK4Exn_NB0ud35yO3rqgMBgE5W/view?usp=sharing

Hip Hop: https://drive.google.com/file/d/1FdC8tXC0VNFnq3vjrtCj42-4mf_phqLj/view?usp=sharing

Lisp: https://drive.google.com/file/d/1elKcbNZEGdQUMDCWK_bqo67Te5pxCjq-/view?usp=sharing

Problem:

Write a Java class named SourceModel reads a file containing a training corpus and builds a first-order Markov chain of the transition probabilities between letters in the corpus. Only alphabetic characters in the corpus should be considered and they should be normalized to upper or lower case. For simplicity (see background) only consider the 26 letters of the English alphabet.

You can assume corpus files are of the form .corpus.

Requirements:

Write a class called SourceModel with the following constructors and methods:

1. A single constructor with two String parameters, where the first parameter is the name of the source model and the second is the file name of the corpus file for the model. The constructor should create a letter-letter transition matrix using this recommended algorithm sketch:

- Initialize a 26x26 matrix for character counts

- Print Training {name} model

- Read the corpus file one character at a time, converting all characters to lower case and ignoring any non-alphabetic character.

- For each character, increment the corresponding (row, col) in your counts matrix. The row is the for the previous character, the col is for the current character. (You could also think of this in terms of bigrams.)

- After you read the entire corpus file, youll have a matrix of counts.

- From the matrix of counts, create a matrix of probabilities each row of the transition matrix is a probability distribution (A probabilities in a distribution must sum to 1. To turn counts into probabilities, divide each count by the sum of all the counts in a row.)

- Print done. followed by a newline character.

2. A getName method with no parameters which returns the name of the SourceModel.

3. A toString method which returns a String representation of the model like the one shown below under Running Your Program in jshell.

4. A probability method which takes a String and returns a double which indicates the probability that the test string was generated by the source model, using the transition probability matrix created in the constructor. Heres a recommended algorithm:

- Initialize the probability to 1.0

- For each two-character sequences of characters in the test string test, cici and ci+1ci+1 for i=0i=0 to test.length()1test.length()1, multiply the probability by the entry in the transition probability matrix for the c1c1 to c2c2 transition, which should be found in row cici an column ci+1ci+1 in the matrix. (You could also think of the indices as ci1,cici1,ci for i=1i=1 to test.length()1test.length()1.)

5. A main method that makes SourceModel runnable from the command line. You program should take 1 or more corpus file names as command line arguments followed by a quoted string as the last argument. The program should create models for all the corpora and test the string with all the corpora. Heres an algorithm sketch:

- The first n-1 arguments to the program are corpus file names to use to train models. Corpus files are of the form .corpus

- The last argument to the program is a quoted string to test.

- Create a SourceModel object for each corpus

- Use the models to compute the probability that the test text was produced by the model

- Probabilities will be very small. Normalize the probabilities of all the model predictions to a probability distribution (so they sum to 1) (closed-world assumption we only state probabilities relative to models we have).

- Print results of analysis

Sample output:

image text in transcribed

sjava SourceModel *.corpus "If you got a gun up in your waist please don't shoot up the place (why?)" Training english model done Training french model done Training hiphop model done Training lisp model done Training spanish model done Analyzing: If you got a gun up in your waist please don't shoot up the place {why? Probability that test string is english: 8.9 Probability that test string is french: 8.9 Probability that test string is hiphop: 1.98 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely hiphop lisp: 8.90 Java SourceModel *.corpus "Ou va le monde?" Training english model done Training french model .. done Training hiphop model .. done Training lisp model done Training spanish model done Analyzing: 0u va le monde? Probability that test string is english: 8.82 Probability that test string is french: 8.85 Probability that test string is hiphop: 8.91 Probability that test string is Probability that test string is spanish: 8.91 Test string is most likely french isp: 8.18 java SourceModel *.corpus "My other car is a cdr." Training english model done Training french model .. done Training hiphop model .. done Training lisp model. Training spanish model done Analyzing: My other car is a cdr Probability that test string is english: 8.39 Probability that test string is french: 8.9 Probability that test string is hiphop: 8.61 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely hiphop done lisp: 8.90 $. Java SourceModel *.corpus "defun Let there be rock" Training english model done Training french model .. done Training hiphop model .. done Training lisp model done Training spanish model done Analyzing: defun Let there be rock Probability that test string is english: 8.91 Probability that test string is french: 8.9 Probability that test string is hiphop: 8.42 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely lisp lisp: 8.57 sjava SourceModel *.corpus "If you got a gun up in your waist please don't shoot up the place (why?)" Training english model done Training french model done Training hiphop model done Training lisp model done Training spanish model done Analyzing: If you got a gun up in your waist please don't shoot up the place {why? Probability that test string is english: 8.9 Probability that test string is french: 8.9 Probability that test string is hiphop: 1.98 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely hiphop lisp: 8.90 Java SourceModel *.corpus "Ou va le monde?" Training english model done Training french model .. done Training hiphop model .. done Training lisp model done Training spanish model done Analyzing: 0u va le monde? Probability that test string is english: 8.82 Probability that test string is french: 8.85 Probability that test string is hiphop: 8.91 Probability that test string is Probability that test string is spanish: 8.91 Test string is most likely french isp: 8.18 java SourceModel *.corpus "My other car is a cdr." Training english model done Training french model .. done Training hiphop model .. done Training lisp model. Training spanish model done Analyzing: My other car is a cdr Probability that test string is english: 8.39 Probability that test string is french: 8.9 Probability that test string is hiphop: 8.61 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely hiphop done lisp: 8.90 $. Java SourceModel *.corpus "defun Let there be rock" Training english model done Training french model .. done Training hiphop model .. done Training lisp model done Training spanish model done Analyzing: defun Let there be rock Probability that test string is english: 8.91 Probability that test string is french: 8.9 Probability that test string is hiphop: 8.42 Probability that test string is Probability that test string is spanish: 8.9 Test string is most likely lisp lisp: 8.57