Question
implement a class called MarkovModel.java that can build a k -th order Markov model from text stored in a given file. (Eventually, you will build
implement a class called MarkovModel.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(pc) / N(p), where N(pc) denotes the number of times we observe the concatenated sequence pc, and N(p) denotes the number of times that we observe p.
Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(pc) + 1) / (N(p) + S), where S is the size of our alphabet. This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.
For example, if the input text is "aabcabaacaac", and we are using a second-order Markov model (k = 2) and the three-letter alphabet is {a, b, c}, then we can estimate the probability that "aa" is followed by 'c'; by (2 + 1) / (3 + 3) = 1/2 since N(aa) = 3, N(aac) = 2 and S = 3. Similarly, the probabilities that "aa" is followed by 'a' and 'b' is 1/6 and 1/3, respectively. Note that the probabilities sum to 1.
One thing to notice here: To handle the beginning and end of the string, we treat the string as circular. Thus, N(caa) = 2 instead of 1.
So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N() by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use. You should also compute and record the alphabet size S. Organize your program by creating a data type MarkovModel so that the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace smoothed probability estimates.
MarkovModel model = new MarkovModel(k, s); System.out.println(model); System.out.printf("%.4f ", model.laplace("aac")); System.out.printf("%.4f ", model.laplace("aaa")); System.out.printf("%.4f ", model.laplace("aab"));
For example, when k = 2 and s = "aabcabaacaac", this code fragment should print the following.
S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started