Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Using Spark and nothing else!!!!! The file docword.nytimes.txt answer the following questions. The file looks like this but there are over 1000000 lines The first

Using Spark and nothing else!!!!!The file docword.nytimes.txt answer the following questions. The file looks like this but there are over 1000000 lines

The first 10 lines and the last line are of the txt file are. provide. There are over 1000000 lines like this.

1 10 1

1 120 5

1 542 1

1 7821 1

2 576 3

2 746 1

2 7821 5

3 512 1

.....

10000 1243 2

The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:

---

D

W

NNZ

docID wordID count

docID wordID count

docID wordID count

docID wordID count

...

docID wordID count

docID wordID count

docID wordID count

---

The format of the vocab.*.txt file is line contains wordID=n.

Please provide the list of commands that were used to solve the problem. Indicate the answers to the questions at the top of the file. You will have to remove the first 3 lines of the txt file

a) Which document has the most total words?

b) Which document has the most unique words?

(c) We will define the lexical richness of a document to be the total number of distinct words in a document divided by the total number of words in the document. What is the average lexical richness of the documents?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Professional Android 4 Application Development

Authors: Reto Meier

3rd Edition

1118223853, 9781118223857

More Books

Students also viewed these Programming questions