Question

1 Approved Answer

Posted on Sep 22, 2024

Task 2 Building a Markov model of a text sample (MarkovModel) For this task, you will need to complete the MarkovModel class, which generates a

Task 2 Building a Markov model of a text sample (MarkovModel)

For this task, you will need to complete the MarkovModel class, which generates a Markov model from an input string, and also write a JUnit test for your model. Markov models are probabilistic models (i.e., they model the chances of particular events occurring) and are used for a broad range of natural language processing tasks (including computer speech recognition). They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.

They can be used to estimate the probability that a symbol will appear in a source of text, given the symbols that have preceded it.

A zero-th order Markov model of a text-source is one that estimates the probability that the next character in a sequence is, say, an a, based simply on how frequently it occurs in a sample. Higher-order Markov models generalize on this idea. Based on a sample text, they estimate the likelihood that a particular symbol will occur in a sequence of symbols drawn from a source of text, where the probability of each symbol occurring can depend upon preceding symbols. In a first order Markov model, the probability of a symbol occurring depends only on the previous symbol. Thus, for English text, the probability of encountering a u can depend on whether the previous letter was a q. If it was indeed a q, then the probability of encountering a u should be quite high. For a second order Markov model, the probability of encountering a particular symbol can depend on the previous two symbols; and generally, the probabilities used by a k-th order Markov model can depend on the preceding kk symbols.

A Markov model can be used to estimate the probability of a symbol appearing, given its k predecessors, in a simple way, as follows:

For each context of characters of length kk, we estimate the probability of that context being followed by each letter cc in our alphabet as the number of times the context appears followed by cc, divided by the number of times the context appears in total. As with our NgramAnalyser class, we consider our input string to wrap round when analysing contexts near its end. Call this way of estimating probabilities simple estimation.

For instance, consider the string aabcabaacaac. The 2- and 3-grams in it (assuming wrap-around) are as follows:

2-gram frequencies
2-gram	frequency
"aa"	3
"ab"	2
"ac"	2
"ba"	1
"bc"	1
"ca"	3

3-gram frequencies
3-gram	frequency
"aab"	1
"aac"	2
"aba"	1
"abc"	1
"aca"	2
"baa"	1
"bca"	1
"caa"	2
"cab"	1

Given the context "aa", we can simply estimate the probability that the next character is "b" as:

P(next character is a b" if last 2 were aa")=(number of occurrences of aab")(number of occurrences of aa")=13P(next character is a b" if last 2 were aa")=(number of occurrences of aab")(number of occurrences of aa")=13

You will need to write code in the MarkovModel class that builds a k-th order Markov model from an input string. Complete the MarkovModel class as follows. You may wish to complete sub-task (e) before sub-task (d).

Fill in the @author and @version fields in the header comments with all authors names, student numbers and the date.

Complete the code for the constructor, which takes an integer k, and an input string s, and initializes any data structures needed for a kk-th order Markov model.

Complete the double simpleEstimate(String sequence) method, which estimates the likelihood of the last character in a string appearing, given its predecessors, using the formula for simple estimation given above. Your code should catch and handle ArithmeticExceptions appropriately. The method should return 0 if the input sequence is not in the n-gram model.

The simpleEstimate will not provide useful estimates if, for instance, the number of occurrences of an n-gram we are looking for is zero. For the double laplaceEstimate(String sequence) method, we will handle such cases by applying Laplace smoothing, which ensures that frequencies of zero will not cause a problem.

Let np,cnp,c represent the number of times a context pp is followed by a character cc.

Let npnp represent the number of times a context pp appears in total.

The Laplace smoothed estimate of the probability of cc following pp is

np,c+1np+Snp,c+1np+S ,

where SS is the size of the alphabet being used.

Implement the double laplaceEstimate(String sequence) method using Laplace smoothing.

Given the value k = 2, and the input string aabcabaacaac, the following code fragment should construct a Markov model and then print estimates of particular length-3 sequences appearing:

MarkovModel model = new MarkovModel(k, s); System.out.printf("%.4f ", model.laplaceEstimate("aac")); System.out.printf("%.4f ", model.laplaceEstimate("aaa")); System.out.printf("%.4f ", model.laplaceEstimate("aab"));

The expected output in this case is the text

0.5000 0.1667 0.3333

In the testSimpleExample and testLaplaceExample methods of the ProjectTest class, implement tests which confirm that when a Markov model is constructed with k = 2 and an input string of aabcabaacaac, the results of calling laplaceEstimate("aac"), laplaceEstimate("aaa") and laplaceEstimate("aab") on the model are within 0.0001 of the numbers given in the sample output in sub-task (d) and that the values returned by simpleEstimate are also correct.

Complete the String toString() method of the MarkovModel class. The returned string should consist of:

A line containing the parameter k used for the model

A line containing the alphabet size used for the model

The string created by concatenating the results of the toString() methods of the ngram and n1gram fields of the MarkovModel class.

Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.

MarkovModel Class

import java.util.Set; /** * Construct a Markov model of order /k/ based on an input string. * * @author (your name) * @version (a version number or a date) */ public class MarkovModel {

/** Markov model order parameter */ int k; /** ngram model of order k */ NgramAnalyser ngram; /** ngram model of order k+1 */ NgramAnalyser n1gram;

/** * Construct an order-k Markov model from string s * @param k int order of the Markov model * @param s String input to be modelled */ public MarkovModel(int k, String s) { //TODO replace this line with your code }

/** * @return order of this Markov model */ public int getK() { return k; }

/** Estimate the probability of a sequence appearing in the text * using simple estimate of freq seq / frequency front(seq). * @param sequence String of length k+1 * @return double probability of the last letter occuring in the * context of the first ones or 0 if front(seq) does not occur. */ public double simpleEstimate(String sequence) { //TODO replace this line with your code return -1.0;

} /** * Calculate the Laplacian probability of string obs given this Markov model * @input sequence String of length k+1 */ public double laplaceEstimate(String sequence) { //TODO replace this line with your code return -1.0; }

/** * @return String representing this Markov model */ public String toString() { //TODO replace this line with your code return null; }

}

Test Cases:

@Test(timeout=1000) public void testLaplaceExample() { assertEquals(0,1); //TODO replace with test code }

@Test(timeout=1000) public void testSimpleExample() { assertEquals(0,1); //TODO replace with test code }

Link to Task 1: https://www.chegg.com/homework-help/questions-and-answers/task-1-analysing-n-grams-sample-text-ngramanalyser-task-need-complete-ngramanalyser-class--q22251885