Question
The CiteSeer UMD collection is a standard text document collection, consisting of abstracts of research articles from Computer Science, which are sampled from the CiteSeer
The CiteSeer UMD collection is a standard text document collection, consisting of abstracts of research articles from Computer Science, which are sampled from the CiteSeer digital library.
Tasks:
-
Write a program that preprocesses the collection. This preprocessing stage should specifically include a function that tokenizes the text. In doing so, tokenize on whitespace and remove punctuation. For this task, please use your own implementation of a tokenizer.
-
Determine the frequency of occurrence for all the words in the collection. Answer the following questions:
-
What is the total number of words in the collection?
-
What is the vocabulary size? (i.e., number of unique terms).
-
What are the top 20 words in the ranking? (i.e., the words with the highest
frequencies).
-
From these top 20 words, which ones are stop-words?
-
What is the minimum number of unique words accounting for 15% of the total
number of words in the collection? Example: if the total number of words in the collection is 100, and we have the fol- lowing word-frequency pairs:
The 20 of 10 a 10 date 8
-
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started