Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Tasks The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add

Tasks

The goal of the project is to complete the code for the NgramAnalyser, MarkovModel, ModelMatcher and MatcherController classes, as detailed below, and to add test code to a new JUnit test class, ProjectTest. Although it would be a better design to split up your new test cases, we have chosen to use just one file to help with marking. The other provided JUnit test classes and the FileIO class are provided for your convenience in testing your submission and completing Task 4. They should not be altered.

First download the zip file containing outline classes for the project: project-class-outlines.zip. Save these files to a project directory on your computer.

Links to additional test class files will be posted here as required during the project.

Task 1 Analysing n-grams in a sample text (NgramAnalyser)

For this task, you will need to complete the NgramAnalyser class, and add code to the ProjectTest class. The NgramAnalyser class analyses an input string, passed to it in the constructor, and counts all the n-grams of letters that occur in the string. An n-gram is simply a (contiguous) sequence of nn items from a piece of text the items we will be considering for this class are characters. (One could also analyse n-grams of words, syllables, or even sentences.) For instance, a 2-gram (also called a bigram) is a pair of characters, a 3-gram is a triple of characters, and so on.

By way of example, consider the following string:

"the rain in Spain"

The alphabet size is 10 (unique characters including spaces) and the 2-grams in the string are:

"th", "he", "e ", " r", "ra", "ai", "in", "n ", " i", "in", "n ", " S", "Sp", "pa", "ai", "in", "nt"

If we remove duplicates, they are:

"th", "he", "e ", " r", "ra", "ai", "in", "n ", " i", " S", "Sp", "pa", "nt"

And if we count how often each 2-gram appears in the input string, we get the following results:

Two-gram frequencies
2-gram	Frequency
" S"	1
" i"	1
" r"	1
"Sp"	1
"ai"	2
"e "	1
"he"	1
"in"	3
"n "	2
"pa"	1
"ra"	1
"th"	1
"nt"	1*

The NgramAnalyser class is given a string as input to its constructor, and optionally an n-gram size nn. It should analyse the n-grams in the input string, and record their frequencies in the hash-map ngram. It should also record the total number of distinct characters that appear in the input string (i.e., the alphabet used by the input string), and store this count in the field alphabetSize.

The analyses performed by methods in this class should consider the input string to wrap around: i.e., the first character in the string should be considered to follow the last character. Otherwise, there will be a number of n-grams at the end of the string which are shorter than n, and must be padded out in some way. More precisely: when counting frequencies of distinct n-grams in the input string, the final n1n1 n-grams in the input string should have added to them a sequence of contiguous characters from the start of the string in order to ensure they are of length nn. For instance, in the example given above: for the purposes of calculating n-gram frequencies, the last 2-gram in the input text would be "nt" (i.e., it is if the text ended in Spaint). And in the string "abbc", the 3-grams would be"abb", "bbc", "bca", and "cab".

Given this background information, complete the following sub-tasks. You may wish to complete sub-task (h) before commencing sub-task (g).

Fill in the @author and @version fields in the header comments with your name, student number and the date.

Complete the code for the constructor NgramAnalyser(int n, String inp). In addition to counting frequencies of n-grams, the constructor should record the number of distinct characters encountered, and store this in the alphabetSize field.

You should not change the case (upper or lower) of the input string, and should process all characters including spaces and punctuation. The constructor should throw an unchecked IllegalArgumentException if the input fields are unsuitable. The following cases are considered unsuitable: the input string is the empty string, the input string is null, the given n-gram size is 0, or n is greater than the length of the input string.

Complete the code for the getAlphabetSize() method. The alphabet size is the number of distinct characters in the input string.

Complete the code for the getNgramFrequency(String ngram) method. This method should return the frequency with which a particular n-gram appears in the text. If it does not appear at all, return 0.

Complete the code for the getDistinctNgramCount() method. This should return the number of distinct n-grams which appear in the input text.

Complete the code for the getNgramCount() method. This should return the total number of n-grams which appear in the input text i.e., duplicate n-grams are considered distinct.

Complete the code for the toString() method. This should return a string consisting of:

a header line, containing the size n of the n-grams that are being counted; and

one line for each distinct n-gram encountered. Each line should contain the characters appearing in the n-gram, then a space character, then the number of times that the n-gram appears in the input text. The lines should appear in alphabetic order of n-gram text.

For instance, if the string "abbc" were analysed, and frequencies of 3-grams were counted, then the output would be a string containing the following lines. Your output should match the spaces and line breaks exactly as shown in this example.

3 abb 1 bbc 1 bca 1 cab 1

Complete the code for the testSensibleToStringSize test in the ProjectTest. The minimum number of distinct n-grams that could appear in an input string is equal to the alphabet size of the string (since if a character appears at all in the string, and if we consider the string to wrap around, then it follows there is at least one distinct n-gram starting with that character). Therefore, the minimum number of lines that should be contained in the string returned by the toStringmethod, when the input string was of non-zero length and nn was greater than zero, is 1 (for the header line) plus the size of the alphabet for the string.

Write code for a test which ensures the result of the toString method contains at least this number of lines.

Ensure you complete Javadoc comments for each method in both the NgramAnalyser and ProjectTest classes including @param, @returns and @throws fields as required.

You may add additional methods to the NgramAnalyser class if necessary, but should not add additional fields, nor change the type or name of the existing fields, nor change the signatures of the existing methods.

The testGetDistinctNgrams test in the ProjectTest has been left incomplete. You are not required to add code to this method, but it is strongly suggested you try and think of tests you could apply to the results of your getDistinctNgrams method. For instance, for a given input string, can you think of some number which the size of the set returned by testGetDistinctNgrams should never be below? Is there some number it should never exceed? You are also encouraged to add other tests, if you can think of them (but ensure they compile correctly do not submit code that does not compile).

Task 2 Building a Markov model of a text sample (MarkovModel)

For this task, you will need to complete the MarkovModel class, which generates a Markov model from an input string, and also write a JUnit test for your model. Markov models are probabilistic models (i.e., they model the chances of particular events occurring) and are used for a broad range of natural language processing tasks (including computer speech recognition). They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.

They can be used to estimate the probability that a symbol will appear in a source of text, given the symbols that have preceded it.

A zero-th order Markov model of a text-source is one that estimates the probability that the next character in a sequence is, say, an a, based simply on how frequently it occurs in a sample. Higher-order Markov models generalize on this idea. Based on a sample text, they estimate the likelihood that a particular symbol will occur in a sequence of symbols drawn from a source of text, where the probability of each symbol occurring can depend upon preceding symbols. In a first order Markov model, the probability of a symbol occurring depends only on the previous symbol. Thus, for English text, the probability of encountering a u can depend on whether the previous letter was a q. If it was indeed a q, then the probability of encountering a u should be quite high. For a second order Markov model, the probability of encountering a particular symbol can depend on the previous two symbols; and generally, the probabilities used by a k-th order Markov model can depend on the preceding kk symbols.

A Markov model can be used to estimate the probability of a symbol appearing, given its k predecessors, in a simple way, as follows:

For each context of characters of length kk, we estimate the probability of that context being followed by each letter cc in our alphabet as the number of times the context appears followed by cc, divided by the number of times the context appears in total. As with our NgramAnalyser class, we consider our input string to wrap round when analysing contexts near its end. Call this way of estimating probabilities simple estimation.

For instance, consider the string aabcabaacaac. The 2- and 3-grams in it (assuming wrap-around) are as follows:

2-gram frequencies
2-gram	frequency
"aa"	3
"ab"	2
"ac"	2
"ba"	1
"bc"	1
"ca"	3

3-gram frequencies
3-gram	frequency
"aab"	1
"aac"	2
"aba"	1
"abc"	1
"aca"	2
"baa"	1
"bca"	1
"caa"	2
"cab"	1

Given the context "aa", we can simply estimate the probability that the next character is "b" as:

P(next character is a b" if last 2 were aa")=(number of occurrences of aab")(number of occurrences of aa")=13P(next character is a b" if last 2 were aa")=(number of occurrences of aab")(number of occurrences of aa")=13

You will need to write code in the MarkovModel class that builds a k-th order Markov model from an input string. Complete the MarkovModel class as follows. You may wish to complete sub-task (e) before sub-task (d).

Fill in the @author and @version fields in the header comments with all authors names, student numbers and the date.

Complete the code for the constructor, which takes an integer k, and an input string s, and initializes any data structures needed for a kk-th order Markov model.

Complete the double simpleEstimate(String sequence) method, which estimates the likelihood of the last character in a string appearing, given its predecessors, using the formula for simple estimation given above. Your code should catch and handle ArithmeticExceptions appropriately. The method should return 0 if the input sequence is not in the n-gram model.

The simpleEstimate will not provide useful estimates if, for instance, the number of occurrences of an n-gram we are looking for is zero. For thedouble laplaceEstimate(String sequence) method, we will handle such cases by applying Laplace smoothing, which ensures that frequencies of zero will not cause a problem.

Let np,cnp,c represent the number of times a context pp is followed by a character cc.

Let npnp represent the number of times a context pp appears in total.

The Laplace smoothed estimate of the probability of cc following pp is

np,c+1np+Snp,c+1np+S ,

where SS is the size of the alphabet being used.

Implement the double laplaceEstimate(String sequence) method using Laplace smoothing.

Given the value k = 2, and the input string aabcabaacaac, the following code fragment should construct a Markov model and then print estimates of particular length-3 sequences appearing:

MarkovModel model = new MarkovModel(k, s); System.out.printf("%.4f ", model.laplaceEstimate("aac")); System.out.printf("%.4f ", model.laplaceEstimate("aaa")); System.out.printf("%.4f ", model.laplaceEstimate("aab"));

The expected output in this case is the text

0.5000 0.1667 0.3333

In the testSimpleExample and testLaplaceExample methods of the ProjectTest class, implement tests which confirm that when a Markov model is constructed with k = 2 and an input string of aabcabaacaac, the results of calling laplaceEstimate("aac"), laplaceEstimate("aaa") and laplaceEstimate("aab") on the model are within 0.0001 of the numbers given in the sample output in sub-task (d) and that the values returned by simpleEstimate are also correct.

Complete the String toString() method of the MarkovModel class. The returned string should consist of:

A line containing the parameter k used for the model

A line containing the alphabet size used for the model

The string created by concatenating the results of the toString() methods of the ngram and n1gram fields of the MarkovModel class.

Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.

Task 3 Matching test strings to a model (ModelMatcher)

For this task, you will need to complete the ModelMatcher class, which determines which of two Markov models better matches a test string. Given two Markov models built from text samples taken from two different sources, we can use them to estimate which source it is more likely a test string was drawn from. Even using only zero-th order models constructed using English and Russian text, we should be able to tell, for instance, that the string

is more likely to be Russian than English.

We will computer a measure of fit of test strings against models, called the likelihood of a sequence under the model. The likelihood of a sequence under a kk-th order Markov model is calculated as follows:

For each symbol c in the sequence, compute the probability of observing c under the model, given its k-letter context p (assuming the sequence to wrap around, as described above), using the Laplace-smoothed estimate of probability we used for the MarkovModel class.

Compute the likelihood of the entire sequence as the product of the likelihoods of each character.

In mathematical notation:

Let ss be an input sequence of length nn, and let MM be a kk-th order Markov model. In order to calculate the likelihood of the sequence ss under the model MM, for each symbol cici in ss (where 1in1in), let pipi be the kk-length context of the symbol cici (assuming wraparound). The likelihood of the sequence ss under the model is

ni=1laplace(ci)i=1nlaplace(ci)

where laplacelaplace is the Laplace-smoothed probability of cici occurring (given its context) as described in the previous task.

The probability we obtain from this calculation may be very small in fact, potentially so small as to be indistinguishable from zero when using Javas built-in floating-point arithmetic. Therefore, we will calculate and express the likelihood using log probabilities, which do not suffer from this problem. (A weakness of log probabilities is that they cannot straightforwardly represent probabilities of zero, but our use of Laplace smoothing allows us to avoid this problem.) The product of two probabilities pp and qq can be calculated by adding their log probabilities, logplogp and logqlogq.

By way of example, suppose we have constructed a 2nd-order Markov model using the input string aabcabaacaac, as described in the example for Task 2. If we are then given a test string "aabbcaac", we can compute its log likelihood as follows:

For each character in the test string, obtain its length-2 context (assuming wraparound). Note that the tri-gram "caa" occurs twice in the (wrapped) test string.

Characters and contexts in test string "aabbcaac"
Context	Character	Frequency
"aa"	'b'	1
"ab"	'b'	1
"bb"	'c'	1
"bc"	'a'	1
"ca"	'a'	2
"aa"	'c'	1
"ac"	'a'	1

For each character in the test string, calculate its Laplace-smoothed probability of occurring (given its context), and then take the log of this.

Laplace estimate and log likelihoods for test string "aabbcaac"
Context	Character	Laplace estimate	log10log10 of laplace estimate
"ac"	'a'	2+12+32+12+3	-0.2218
"aa"	'c'	2+13+32+13+3	-0.3010
"bc"	'a'	1+11+31+11+3	-0.3010
"aa"	'b'	1+13+31+13+3	-0.4771
"bb"	'c'	0+10+30+10+3	-0.4771
"ca"	'a'(2x)	2+13+32+13+3	-0.6021
"ab"	'b'	0+12+30+12+3	-0.6990

Then

total log likelihood = sum of log probabilities = -3.0792, and

average log likelihood = sum of log probabilities = -0.3849

If we have two Markov models, and one test string, then we can select the best model as being the one that maximizes the likelihood of the string under the model i.e. we choose the model with the higher average log likelihood. Notice that the example above lists ngrams from that most in agreement with the model (highest log likelihood) to the least likely. This helps us to explain why a particular test string matches a model or not.

You will need to write code in the ModelMatcher class that takes as input one Markov model and a test string for which we want to identify how well it matches the given model. Complete the ModelMatcher class as follows. You may wish to complete sub-task (c) before sub-task (b).

Fill in the @author and @version fields in the header comments with all authors names, student numbers and the date.

Complete the constructor, which calculates and sets the fields in the class. Getter methods for these fields are already complete.

Complete the totalLogLikelihood method, a private helper method which sums a map containing log likelihoods for strings, and the averageLogLikelihood helper method, which calculates the average.

Complete the toString method for the ModelMatcher class. This method should give a summary of the results. You may choose how to arrange the string, but it should include at least the name, averageLogLikelihood and a table of the loglikelihood probabilities as shown in the example above.

Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.

Task 4 - User interface controller class (MatcherController)

Your final required task is to complete a class which can be used to create and modify models and measure how well a test string matches them, as part of a user application. The class will represent the Controller portion of an application contructed using the Model View Controller (MVC) software architectural pattern. Acontroller class in this pattern understands how to create and manipulate a model of some sort (a database of students, say), and generate output from the model which can be displayed in a view class that displays the model in some way. The model might be displayed using text output to a terminal, or graphs displayed on a graphical user interface (GUI), or in many other ways. The controller can be seen as acting as an intermediary between the a user and the model: it accepts requests from the user, updates the model in some way based on those requests, and uses the model to generate output which can be displayed to the user in a convenient form. In this case, the model consists of a ModelMatcher object, and any other objects needed to accurately represent the state of the application.

Note that it is generally considered poor design to allow exceptions to reach the user an end-user of your application is unlikely to know how to deal with a message written to output saying File input error, with a stack-trace (assuming there even is an output for stack traces to be written to; on some platforms, such as Android mobile phones, there may not be). You should try to forestall possible error conditions by catching and handling the errors thrown by, for example, the FileIO class.

You will need to write code in the MatcherController class that can create and manipulate an ArrayList collection of ModelMatcher objects and display their results for convenient viewing.

You will need to decide what fields the MatcherController class will require (although at a minimum, it will need a ModelMatcher field, which we have already included). You will also need to decide what helper methods (if any) are needed, and what code will need to go in the constructor for the class.

Fill in the @author and @version fields in the header comments with all authors names, student numbers and the date.

Complete the constructor and each of the stub methods in the MatcherController class, implementing the functionality described in their Javadoc comments.

Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.

Extension Task

Full marks for this project can be obtained by completing Tasks 1, 2, 3 and 4 according to the specifications above. But if you have time and wish to extend your knowledge of Java, then you are encouraged to extend your project to take it further towards a complete application. It is up to you to identify, select and complete additional functionality in your extension code. Your extension code may not alter the specifications of the the main project code. Up to 2 bonus marks will be awarded for the completion and demonstration of extension tasks.

Some possible extension ideas are:

Create a class called MatcherGame which builds two or more Markov models from training text by different authors (or languages or styles or ) and then generates a test string belonging to one of the trained categories and asks a user to guess which of the categories it belongs. Also use your system to make a prediction. Report the correct answer. Your class may include additional functionality, which is left to your imagination. For example, it may search for optimal training parameters such as ngram size to maximise the accuracy of its predictions.

Create a graphical user interface (GUI) class, called MatcherViewer, which displays sample texts and a test string, and the results of using the ModelMatcher class to determine which of the two sample texts is a better match for the test string.

It could allow a user to:

select two different files from the filesystem, each of which will be read in and its contents used to build one of the Markov models in the ModelMatcher

change the order parameter kk used by the ModelMatcher

identify visually which of the two models is a better fit

request (e.g. by clicking a button) an explanation of why a model has been selected as the best fit, and display the text returned by the getExplanation method of the MatcherController class.

Your GUI may include additional functionality, which is left to your imagination you might think of new ways to graphically display the differences between the models, for instance.

You may add additional methods and fields to the project classes from Tasks 1 to 4 but you must not change the signatures of existing methods or the names of existing fields.

Your extension class(es) should include a public static void main(String args[]) method which allows the system to be launched from the command line. It should not require any command-line arguments to be passed. A GUI should initially start with an empty MatcherController. A Game should initially start with a trained system ready to play.

(

/* * Utility function for reading a file into a list of strings * * e.g. calling FileIO.readFile("f.txt"); * where file f.txt contains "abc de gh " * will return an array list <"abc", "de", "", "gh"> * * @author Rachel Cardell-Oliver * based on Barnes and Koelling ResponseReader class * @version May 2017 */

import java.io.*; import java.util.*;

public class FileIO { /** * Read a file from the current directory into a list * @param String filename name of the file to be read * @throws FileNotFoundException, IOException * @return ArrayList of lines from the file */ public static ArrayList readFile(String filename) throws FileNotFoundException, IOException { ArrayList filelines = new ArrayList();

BufferedReader reader = new BufferedReader(new FileReader(filename)); String line = reader.readLine(); while(line != null) { filelines.add(line); line = reader.readLine(); } reader.close(); return filelines; } }

)

(

import java.util.Set; /** * Construct a Markov model of order /k/ based on an input string. * * @author (your name) * @version (a version number or a date) */ public class MarkovModel {

/** Markov model order parameter */ int k; /** ngram model of order k */ NgramAnalyser ngram; /** ngram model of order k+1 */ NgramAnalyser n1gram;

/** * Construct an order-k Markov model from string s * @param k int order of the Markov model * @param s String input to be modelled */ public MarkovModel(int k, String s) { //TODO replace this line with your code }

/** * @return order of this Markov model */ public int getK() { return k; }

/** Estimate the probability of a sequence appearing in the text * using simple estimate of freq seq / frequency front(seq). * @param sequence String of length k+1 * @return double probability of the last letter occuring in the * context of the first ones or 0 if front(seq) does not occur. */ public double simpleEstimate(String sequence) { //TODO replace this line with your code return -1.0;

} /** * Calculate the Laplacian probability of string obs given this Markov model * @input sequence String of length k+1 */ public double laplaceEstimate(String sequence) { //TODO replace this line with your code return -1.0; }

/** * @return String representing this Markov model */ public String toString() { //TODO replace this line with your code return null; }

}

)

(

import java.io.File; import java.util.ArrayList; import java.util.HashMap; import java.util.Set; import java.io.*;

/** Create and manipulate Markov models and model matchers for lists of training data * a test data String and generate output from it for convenient display. * * @author (your name) * @version (a version number or a date) * */ public class MatcherController {

/** list of training data string used to generate markov models */ ArrayList trainingDataList; /** test data to be matched with the models */ String testData; /** order of the markov models*/ int k; /** generated list of markov models for the given training data*/ ArrayList modelList; /** generated list of matchers for the given markov models and test data*/ ArrayList matcherList;

/** Generate models for analysis * @param k order of the markov models to be used * @param testData String to check against different models * @throw unchecked exceptions if the input order or data inputs are invalid */ public MatcherController(int k, ArrayList trainingDataList, String testData) { //TODO }

/** @return a string containing all lines from a file * ff file contents can be got, otherwise null * This method should process any exceptions that arise. */ private static String getFileContents(String filename) { //TODO return null; }

/** * @return the ModelMatcher object that has the highest average loglikelihood * (where all candidates are trained for the same test string */ public ModelMatcher getBestMatch(ArrayList candidates) { //TODO return null; }

/** @return String an *explanation* of * why the test string is the match from the candidate models */ public String explainBestMatch(ModelMatcher best) { //TODO return null; }

/** Display an error to the user in a manner appropriate * for the interface being used. * * @param message */ public void displayError(String message) { // LEAVE THIS METHOD EMPTY }

}

)

(

import java.util.HashMap; import java.util.Collection; import java.util.ArrayList; import java.util.Arrays;

/** * Report the average log likelihood of a test String occuring in a * given Markov model and detail the calculated values behind this statistic. * * @author (your name) * @version (a version number or a date) */ public class ModelMatcher {

/** log likelihoods for a teststring under a given model */ private HashMap logLikelihoodMap; /** summary statistic for this setting */ private double averageLogLikelihood; /** * Constructor to initialise the fields for the log likelihood map for * a test string and a given Markov model and * the average log likelihood summary statistic * @param MarkovModel model a given Markov model object * @param String teststring */ public ModelMatcher(MarkovModel model, String testString) { //TODO }

/** Helper method that calculates the average log likelihood statistic * given a HashMap of strings and their Laplace probabilities * and the total number of ngrams in the model. * * @param logs map of ngram strings and their log likelihood * @param ngramCount int number of ngrams in the original test string * @return average log likelihood: the total of loglikelihoods * divided by the ngramCount */ private double averageLogLikelihood(HashMap logs, int ngramCount) { //TODO return 0.1; } /** Helper method to calculate the total log likelihood statistic * given a HashMap of strings and their Laplace probabilities * and the total number of ngrams in the model. * * @param logs map of ngram strings and their log likelihood * @return total log likelihood: the sum of loglikelihoods in logs */ private double totalLogLikelihood(HashMap logs) { //TODO return 0.1; }

/** * @return the average log likelihood statistic */ public double getAverageLogLikelihood() { return averageLogLikelihood; } /** * @return the log likelihood value for a given ngram from the input string */ public double getLogLikelihood(String ngram) { return (logLikelihoodMap.get(ngram)); } /** * Make a String summarising the log likelihood map and its statistics * @return String of ngrams and their loglikeihood differences between the models * The likelihood table should be ordered from highest to lowest likelihood */ public String toString() { //TODO return null; }

}

)

(

import java.util.ArrayList; import java.util.HashMap; import java.util.Set;

import java.util.HashSet; import java.util.Arrays;

/** * Perform n-gram analysis of a string. * * Analyses the frequency with which distinct n-grams, of length n, * appear in an input string. For the purposes of all analyses of the input * string, the final n-1 n-grams appearing in the string should be * "filled out" to a length of n characters, by adding * a sequence of contiguous characters from the start of the string. * e.g. "abbc" includes "bca" and "cab" in its 3-grams * * @author (your name) * @version (a version number or a date) */ public class NgramAnalyser { /** dictionary of all distinct n-grams and their frequencies */ private HashMap ngram;

/** number of distinct characters in the input */ private int alphabetSize;

/** n-gram size for this object (new field) */ private int ngramSize;

/** * Analyse the frequency with which distinct n-grams, of length n, * appear in an input string. * n-grams at the end of the string wrap to the front * e.g. "abbbbc" includes "bca" and "cab" in its 3-grams * @param int n size of n-grams to create * @param String inp input string to be modelled */ public NgramAnalyser(int n, String inp) { //TODO replace this line with your code }

/** * Analyses the input text for n-grams of size 1. */ public NgramAnalyser(String inp) { this(1,inp); }

/** * @return int the size of the alphabet of a given input */ public int getAlphabetSize() { //TODO replace this line with your code return -1; }

/** * @return the total number of distinct n-grams appearing * in the input text. */ public int getDistinctNgramCount() { //TODO replace this line with your code return -1; }

/** * @return Return a set containing all the distinct n-grams * in the input string. */ public Set getDistinctNgrams() { //TODO replace this line with your code return null; }

/** * @return the total number of n-grams appearing * in the input text (not requiring them to be distinct) */ public int getNgramCount() { //TODO replace this line with your code return -1; }

/** Return the frequency with which a particular n-gram appears * in the text. If it does not appear at all, return 0. * * @param ngram The n-gram to get the frequency of * @return The frequency with which the n-gram appears. */ public int getNgramFrequency(String ngram) { //TODO replace this line with your code return -1; }

/** * Generate a summary of the ngrams for this object. * @return a string representation of the n-grams in the input text * comprising the ngram size and then each ngram and its frequency * where ngrams are presented in alphabetical order. */ public String toString() { //TODO replace this line with your code return null; }

}

)

(

import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.lang.reflect.Field; import java.util.HashMap;

import org.junit.Test;

/** * Unit tests for the utility class MarkovModel. * Do not change this file. Add new tests to the ProjectTest class. * * @author Arran Stewart * @version May 2017 */ public class NgramAnalyserTest { /** tolerance for double comparisons */ static final double tolerance = 0.0001;

/** Use reflection to extract the ngram map, even though it's private. */ @SuppressWarnings("unchecked") public static HashMap extractMap(NgramAnalyser analyser) { try { Field ngramField = analyser.getClass().getDeclaredField("ngram"); ngramField.setAccessible(true); return (HashMap) ngramField.get(analyser); } catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException ex) { throw new RuntimeException(ex); } }

@Test(timeout=1000) public void testSimpleOneGram() { //default value for n should be 1 NgramAnalyser analyser = new NgramAnalyser("abc"); assertEquals(3,analyser.getAlphabetSize());

HashMap ngram = extractMap(analyser);

assertEquals(3,analyser.getDistinctNgramCount()); assertEquals(1,(int)ngram.get("a")); assertEquals(1,(int)ngram.get("b")); assertEquals(1,(int)ngram.get("c")); }

@Test(timeout=1000) public void testOneGram() { //default value for n should be 1 NgramAnalyser analyser = new NgramAnalyser("aabcabaacaac"); assertEquals(3,analyser.getAlphabetSize());

HashMap ngram = extractMap(analyser);

assertEquals(3,analyser.getDistinctNgramCount()); assertEquals(7,(int)ngram.get("a")); assertEquals(2,(int)ngram.get("b")); assertEquals(3,(int)ngram.get("c")); }

@Test(timeout=1000,expected = IllegalArgumentException.class) //TODO exception catch public void testNullString() { NgramAnalyser analyser = new NgramAnalyser(1, null); }

@Test(timeout=1000,expected = IllegalArgumentException.class) //TODO exception catch public void testEmptyString() { NgramAnalyser analyser = new NgramAnalyser(3, ""); }

@Test(timeout=1000,expected = IllegalArgumentException.class) //TODO exception catch public void testInvalidNtooLow() { NgramAnalyser analyser = new NgramAnalyser(0, "test"); }

@Test(timeout=1000,expected = IllegalArgumentException.class) //TODO exception catch public void testInvalidNtooHigh() { NgramAnalyser analyser = new NgramAnalyser(5, "test"); }

@Test(timeout=1000) public void testBiGrams() { NgramAnalyser analyser = new NgramAnalyser(2, "aabcabaacaac"); assertEquals(3,analyser.getAlphabetSize()); assertEquals(6,analyser.getDistinctNgramCount() ); HashMap ngram = extractMap(analyser);

assertEquals(3,(int) ngram.get("aa")); assertEquals(2,(int) ngram.get("ab")); assertEquals(2,(int) ngram.get("ac")); assertEquals(1,(int) ngram.get("ba")); assertEquals(1,(int) ngram.get("bc")); assertEquals(3,(int) ngram.get("ca"));

}

@Test(timeout=1000) public void testTriGrams() { NgramAnalyser analyser = new NgramAnalyser(3,"aabcabaacaac");

assertEquals(3,analyser.getAlphabetSize()); assertEquals(9,analyser.getDistinctNgramCount() );

HashMap ngram = extractMap(analyser);

assertEquals(1,(int)ngram.get("aab")); assertEquals(2,(int)ngram.get("aac")); assertEquals(1,(int)ngram.get("aba")); assertEquals(1,(int)ngram.get("abc")); assertEquals(2,(int)ngram.get("aca")); assertEquals(1,(int)ngram.get("baa")); assertEquals(1,(int)ngram.get("bca")); assertEquals(2,(int)ngram.get("caa")); assertEquals(1,(int)ngram.get("cab")); }

@Test(timeout=1000) public void testNgramCount() { NgramAnalyser analyser; String[] inputTexts = { "WXyZ", "abc", "aabcabaacaac" }; int[] ngramSizes = {1, 2, 3};

for (String inputText : inputTexts) { for (int n : ngramSizes) { analyser = new NgramAnalyser(n,inputText); assertEquals( inputText.length(), analyser.getNgramCount()); } } }

@Test(timeout=1000) public void fieldsUnaltered() { NgramAnalyser analyser = new NgramAnalyser(1,"aa"); Class clazz = analyser.getClass(); assertEquals("should be no public fields", 0, clazz.getFields().length ); int nf = clazz.getDeclaredFields().length; assertTrue("should be at least 2 fields", nf >= 2); //assertEquals("should be 2 fields total", 2, clazz.getDeclaredFields().length ); //allow more than this

try { Field ngramField = clazz.getDeclaredField("ngram"); Field alphField = clazz.getDeclaredField("alphabetSize"); } catch (NoSuchFieldException | SecurityException e) { throw new RuntimeException(e); } }

}

)

(

import static org.junit.Assert.*; import org.junit.After; import org.junit.Before; import org.junit.Test;

/** * The test class ProjectTest for student test cases. * Add all new test cases to this task. * * @author (your name) * @version (a version number or a date) */ public class ProjectTest { /** * Default constructor for test class ProjectTest */ public ProjectTest() { }

/** * Sets up the test fixture. * * Called before every test case method. */ @Before public void setUp() { }

/** * Tears down the test fixture. * * Called after every test case method. */ @After public void tearDown() { } //TODO add new test cases from here include brief documentation @Test(timeout=1000) public void testSensibleToStringSize() { assertEquals(0,1); //TODO replace with test code }

@Test(timeout=1000) public void testGetDistinctNgrams() { assertEquals(0,1); //TODO replace with test code } @Test(timeout=1000) public void testLaplaceExample() { assertEquals(0,1); //TODO replace with test code } @Test(timeout=1000) public void testSimpleExample() { assertEquals(0,1); //TODO replace with test code }

@Test public void testTask3example() { MarkovModel model = new MarkovModel(2,"aabcabaacaac"); ModelMatcher match = new ModelMatcher(model,"aabbcaac"); assertEquals(0,1); //TODO replace with test code } }

)