Answered step by step
Verified Expert Solution
Question
1 Approved Answer
I just need help with number 1 writing the program in any language. Tasks: 1. Write a program that preprocesses the collection. This preprocessing stage
I just need help with number 1 writing the program in any language.
Tasks: 1. Write a program that preprocesses the collection. This preprocessing stage should specifically indude a function that tokenizes the text. In doing so, tokenize on whitespace and remove punctuation. For this task, please use your own implementation of a tokenizer. 2. Determine the frequency of occurrence for all the words in the collection. Answer the following questions: 1. What is the total number of words in the collection? 2. What is the vocabulary size? (i.e, number of unique terms). 3. What are the top 20 words in the ranking? (i.e., the words with the highest frequencies). 4. From these top 20 words, which ones are stop-words? 5. What is the minimum number of unique words accounting for 15% of the total number of words in the collection? Example: if the total number of words in the collection is 100, and we have the fol- lowing word-frequency pairs: Word the of a data mining 20 10 10 8 7 the answer to this question will be (1 word accounts for 15% of the total 100 words). 3. Integrate the Porter stemmer and a stop word eliminator into your code. Answer again questions a.-e. from the previous point. (See below a link to a Java Porter stemmer implementation and to a stopwords list). https://www.dropbox.com/s/rexuzz3j56vi4bt/Porter.java https://www.dropbox.com/s/5789sj8v07j2ido/stopwords.txtStep by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started