Question
Part 1: Goal Processing & Cleaning Input To process a text file into useful semantic descriptor vectors, you need to 1) split the text into
Part 1:
Goal Processing & Cleaning Input To process a text file into useful semantic descriptor vectors, you need to
1) split the text into sentences (lists of words)
2) clean up the words. Adding a command to index a file Write code for the index FILE command. Add the command to the while loop, such that when the user enters the command followed by a file path, the console prints "Indexing FILENAME". Here is a sample interaction. > index C:\Users\me\similarity\cleanup_test.txt
Indexing C:\Users\me\similarity\cleanup_test.txt
> help Supported commands:
help - Print the supported commands
quit - Quit this program
index FILE - Read in and index the file given by FILE
For every command you add to your program: you also need to add a help message to the printMenu() method. The help message should include (i) the command, (ii) its arguments, and (iii) a description. The exact formatting is not important as long as it has those 3 components. You'll do more with the index command in the next steps. Reading text files into sentences Write code to read the text file into sentences.
You should assume that the following punctuation always separates sentences: ".", "!", "?", and that is the only punctuation that separates sentences. You should assume that the only the following punctuation is present in the texts: [,, --, :, ;, ", ""] We recommend that you use Scanner (https://docs.oracle.com/javase/8/docs/api/java/util/Scanner.html ) to read the file sentence by sentence. It allows you to specify a regular expression to use as a delimiter (see the Fish example in the documentation). Now, the interaction looks like
1 > index C:\Users\me\similarity\cleanup_test.txt
2 Indexing C:\Users\me\similarity\cleanup_test.txt
3 > Except that after line 2 is printed and before line 3 is printed, your program creates a list of the sentences in that file. Cleaning up the words Write code to clean up the words. Small differences in words will cause there to be lots of "unique" words that we wouldn't really consider unique. First, capitalization: if we don't do any cleanup then "Man" and "man" will be considered different words. You should convert all words to lower case. Second, root words: if we don't do any cleanup then "glass" and "glasses" will be considered different words. You should find the roots of all words by using "stemming". To do so, use the PorterStemmer (https://opennlp.apache.org/docs/1.8.3/apidocs/opennlptools/opennlp/tools/stemmer/PorterStemmer.html ) from OpenNLP. See Appendix A for setting up OpenNLP. Another problem is that common words (such as "a", "the", "is") may not add much information to our vectors. We call these words stop words. We have provided a list of stop words in stopword.txt. You should remove any word that appears in file. Adding a command to print the sentences The upcoming Parts will require your program to print other information about the text. To control what information is printed, you will use commands. Add a command to print the sentences and the number of sentences. The particular formatting is not important, as long as we can tell what words are in each sentence and how many sentences there are. Here is are two example interactions up to this point. Example 1
1 > index C:\Users\me\similarity\cleanup_test.txt
2 Indexing C:\Users\me\similarity\cleanup_test.txt
3 > sentences
4 [[look, glum, night-cap], [feel, littl, breez], [ah], [whatev, mai, sai, good, aliv, dear, amd]]
5 Num sentences
6 4
7 > quit Important: After line 2 is printed and before line 3, the program has already internally computed the sentences! The sentences command is only for printing out that list. Example 2
> sentences
[] Num sentences
0
> In this example, there were no sentences because we haven't indexed any files.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started