Question
Using Spark and nothing else!!!!! The file docword.nytimes.txt answer the following questions. The file looks like this but there are over 1000000 lines The first
The first 10 lines and the last line are of the txt file are. provide. There are over 1000000 lines like this.
1 10 1
1 120 5
1 542 1
1 7821 1
2 576 3
2 746 1
2 7821 5
3 512 1
.....
10000 1243 2
The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---
The format of the vocab.*.txt file is line contains wordID=n.
Please provide the list of commands that were used to solve the problem. Indicate the answers to the questions at the top of the file. You will have to remove the first 3 lines of the txt file
a) Which document has the most total words?
b) Which document has the most unique words?
(c) We will define the lexical richness of a document to be the total number of distinct words in a document divided by the total number of words in the document. What is the average lexical richness of the documents?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started