Question

1 Approved Answer

Posted on May 15, 2024

Problem Introduction There Are Many Software Systems For Analyzing The Style And Sophistication Of Written Text And Even Deciding If Two Documents Were Authored By

Problem Introduction There Are Many Software Systems For Analyzing The Style And Sophistication Of Written Text And Even Deciding If Two Documents Were Authored By The Same Individual. The Systems Analyze Documents Based On The Sophistication Of Word Usage, Frequently Used Words, And Words That Appear Closely Together. In This Assignment You Will Write A

the list of strings ['01-34', \"can't\", 42weather67', 'puppy,', 'and 123', 'Ch73%allenge', 10h032use.. 1 and this should be split into the list of (non-empty) strings ['cant'. 'weather, 'puppy'. 'and', 'challenge'. 'house'] Note that the first string. '01-34' is completely removed because it has no letters. All three files stop.txt and the two document files called doci.txt and doc2.txt above should be pansed this way Once this parsing is done, the list resulting from persing the file stop.txt should be converted to a set. This set contains what are referred to in NLP as 'stop words words that appear 80 frequently in text that they should be ignored. The files doci.txt and doc2.txt contain the teact of the two documents to compare. For each, the list returned from parsing should be further modified by removing any stop words. Continuing with our example, if 'cant' and 'and' are stop words, then the word list should be

Calculate and output the average word length, accurate to two decimal places. The idea here is that word length is a rough indicator of sophistication 2. Calculate and output, accurate to three decimal places, the ratio between the number of distinct words and the total number of words. This is a measure of the variety of language used (although it must be remembered that some authors use words and phrases repeatedly to strengthen their message) 3. For each word length starting at 1, find the set of words having that length Print the length the number of different words having that length, and at most six of these words. If for a certain length, there are six or fewer words, then print all sex, but if there are more than six print the first three and the last three in alphabetical order. For example, suppose our simple text

Q10;

The uniq command-line utility has been standard to Unix-based operating systems for a long time. On GNU/Linux, uniq was written by Richard Stallman (AKA Saint IGNUcius) and David MacKenzie.

uniq (by default)prints only the unique linesin its input. uniq also asssumes that its input is already sorted such that unique lines are grouped together.

One of the common ways to run uniq is with the -c option, which adds a count of how many times each line appeared:

grep -Po '[^\\s]+' /srv/datasets/shakespeare-othello.txt | \\

tr '[:upper:]' '[:lower:]' | \\

sed -E 's/(^[^A-Za-z0-9])|([^A-Za-z0-9]+$)//g' | \\

sort | \\

uniq -c

Note that this is one long command, escaped (with backslashes) to be formatted over multiple lines, and consists of multiple piped commands:

grep isolates all whitespace-delimited tokens from Shakespeare's Othello, one word per line

tr makes all uppercase letters lowercase

sed trims any non-alphanumeric characters from the ends of lines

sort sorts all lines alphanumerically

uniq summarizes the unique lines and how many times each occurs

You will find that the last 10 lines of output from this command are:

1 yonders

6 yong

476 you

2 you'l

6 you'le

4 young

225 your

2 you're

6 yours

5 youth

Assignment

You shall write a program in Java that replicates the behavior and output of uniq -c.

That is, your program shall:

Expect input from standard input, consisting of any number oflinesof text. Any duplicate lines are assumed to be sequential.

Print each unique line of input, prefixed by the number of occurrences of that line.

For testing purposes, compare your program's output with uniq -c's. Try the following commands. Substituting your program in place of uniq -c should produce the same output:

# Nucleic acids in human chromosome 11:

fold -w 1 /srv/datasets/chromosome11 | sort | uniq -c

# 1 million digits of pi:

fold -w 1 /srv/datasets/pi1000000 | sort | uniq -c

# Taxonomic ranks:

cut -f 4 /srv/datasets/taxonomy.tab | sort | uniq -c

# Many years worth of baby names in the US:

cut -d , -f 2 /srv/datasets/baby_names_national.csv | sort | uniq -c

# Letter frequency histogram in the KJV

tr -dc '[:alpha:]' tr '[:upper:]' '[:lower:]'