Question

1 Approved Answer

Posted on Sep 08, 2024

The uniq command-line utility has been standard to Unix-based operating systems for a long time. On GNU/Linux, uniq was written by Richard Stallman (AKA Saint

The uniq command-line utility has been standard to Unix-based operating systems for a long time. On GNU/Linux, uniq was written by Richard Stallman (AKA Saint IGNUcius) and David MacKenzie.

uniq (by default) prints only the unique lines in its input. uniq also asssumes that its input is already sorted such that unique lines are grouped together.

One of the common ways to run uniq is with the -c option, which adds a count of how many times each line appeared:

grep -Po '[^\s]+' /srv/datasets/shakespeare-othello.txt | \ tr '[:upper:]' '[:lower:]' | \ sed -E 's/(^[^A-Za-z0-9])|([^A-Za-z0-9]+$)//g' | \ sort | \ uniq -c

Note that this is one long command, escaped (with backslashes) to be formatted over multiple lines, and consists of multiple piped commands:

grep isolates all whitespace-delimited tokens from Shakespeare's Othello, one word per line

tr makes all uppercase letters lowercase

sed trims any non-alphanumeric characters from the ends of lines

sort sorts all lines alphanumerically

uniq summarizes the unique lines and how many times each occurs

You will find that the last 10 lines of output from this command are:

 1 yonders 6 yong 476 you 2 you'l 6 you'le 4 young 225 your 2 you're 6 yours 5 youth

Assignment

You shall write a program in Java that replicates the behavior and output of uniq -c.

That is, your program shall:

Expect input from standard input, consisting of any number of lines of text. Any duplicate lines are assumed to be sequential.

Print each unique line of input, prefixed by the number of occurrences of that line.

For testing purposes, compare your program's output with uniq -c's. Try the following commands. Substituting your program in place of uniq -c should produce the same output:

# Nucleic acids in human chromosome 11: fold -w 1 /srv/datasets/chromosome11 | sort | uniq -c # 1 million digits of pi: fold -w 1 /srv/datasets/pi1000000 | sort | uniq -c # Taxonomic ranks: cut -f 4 /srv/datasets/taxonomy.tab | sort | uniq -c # Many years worth of baby names in the US: cut -d , -f 2 /srv/datasets/baby_names_national.csv | sort | uniq -c # Letter frequency histogram in the KJV tr -dc '[:alpha:]' < /srv/datasets/king-james.txt | tr '[:upper:]' '[:lower:]'