Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

implement a class called MarkovModel.java that can build a k -th order Markov model from text stored in a given file. (Eventually, you will build

implement a class called MarkovModel.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(pc) / N(p), where N(pc) denotes the number of times we observe the concatenated sequence pc, and N(p) denotes the number of times that we observe p.

Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(pc) + 1) / (N(p) + S), where S is the size of our alphabet. This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.

For example, if the input text is "aabcabaacaac", and we are using a second-order Markov model (k = 2) and the three-letter alphabet is {a, b, c}, then we can estimate the probability that "aa" is followed by 'c'; by (2 + 1) / (3 + 3) = 1/2 since N(aa) = 3, N(aac) = 2 and S = 3. Similarly, the probabilities that "aa" is followed by 'a' and 'b' is 1/6 and 1/3, respectively. Note that the probabilities sum to 1.

One thing to notice here: To handle the beginning and end of the string, we treat the string as circular. Thus, N(caa) = 2 instead of 1.

So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N() by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use. You should also compute and record the alphabet size S. Organize your program by creating a data type MarkovModel so that the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace smoothed probability estimates.

MarkovModel model = new MarkovModel(k, s); System.out.println(model); System.out.printf("%.4f ", model.laplace("aac")); System.out.printf("%.4f ", model.laplace("aaa")); System.out.printf("%.4f ", model.laplace("aab")); 

For example, when k = 2 and s = "aabcabaacaac", this code fragment should print the following.

S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Objects And Databases International Symposium Sophia Antipolis France June 13 2000 Revised Papers Lncs 1944

Authors: Klaus R. Dittrich ,Giovanna Guerrini ,Isabella Merlo ,Marta Oliva ,M. Elena Rodriguez

2001st Edition

3540416641, 978-3540416647

More Books

Students also viewed these Databases questions