Question

1 Approved Answer

Posted on Sep 06, 2024

You are going to write a complete C program which implements the following functionality: Your program will read the following files: language_1.txt language_2.txt language_3.txt language_4.txt

You are going to write a complete C program which implements the following functionality: Your program will read the following files: language_1.txt language_2.txt language_3.txt language_4.txt language_5.txt language_x.txt Each file contains text in a specific language. All files contain only english lowercase characters and whitespace. Text files will include the following characters: 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' ' ' Your program will evaluate the dissimilarity scores of language pairs: (language_x, language_1) (language_x, language_2) (language_x, language_3) (language_x, language_4) (language_x, language_5) First of all, calculate bi-gram frequencies for each language. A bi-gram is defined as follows: For a given sequence, each unique pairing of successive letters is a bi-gram. For example: for the word " adana " bi-grams are defined to be " a", "ad", "da", "an", "na", "a ". Beware: If there is a space before or after a character you will still be dealing with bi-grams. Each bi-gram has exactly two elements which are either characters or space. In order to calculate the frequency of a particular bi-gram(lets say bi-gram "ad") you have to count all the bi-grams in a given text and for this bi-gram calculate the ratio (# of "ad")/(total # of all bi-grams) Given all the frequencies, dissimilarity score is calculated as follows: dissimilarity(languagea, languageb) = X i |f i a f i b | (1) - Here f i a represents the frequency of i th bi-gram for the language a. If c i a is the count of i th bi-gram in languagea, then; f i a = c i a/( X j c j a ) (2) After evaluating dissimilarities, your program will print all the dissimilarity values. Print: dissimilarity(language_x, language_1) dissimilarity(language_x, language_2) dissimilarity(language_x, language_3) dissimilarity(language_x, language_4) dissimilarity(language_x, language_5) You will print just the numbers(i.e. the dissimilarity scores). Dont print anything extra.Efficiency of your implementation is important. Dont permanently store the content if you are not going to re-use it. Dont repeatedly calculate any output, if the input is the same. Frequencies of bi-grams in language_x should be calculated only once. Store them in an array and re-use the calculated frequencies in order to calculate the dissimilarity scores. Do not read the files multiple times. text files can include multiple concatenating whitespace. For example: Here we are using a user defined recursive Two adjacent whitespace do not create a bi-gram. Input files can be multi-line text files. There isnt any limit on the size of input files. Your program should work regardless of the size of the input. Be careful with the size of the array you allocate in program stack. Large arrays may not fit in program stack(stack size may be smaller on the test machine) and your program crashes. Make sure you can read input files with or without a tailing newline at the end. (If you are using a windows machine, newline is CRLF, on unix it is LF). You can alter this using advanced editors (i.e. Visual Studio Code). Test your code for every possible combination. Do not print anything other than the expected output. You cannot use anything which is not covered in class. Do not submit any of the files you used for testing.