Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Try to use more basic concept nothing that would be more advance than a class (classes cant be used). fAnalyze Each Document's Word List Once

Try to use more basic concept nothing that would be more advance than a class (classes cant be used).

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed
\fAnalyze Each Document's Word List Once you have produced the word list with stop words removed, you are ready to analyze the word list. There are many ways to do this, but here are the ones required for this assignment: 1. Calculate and output the average word length, accurate to two decimal places. The idea here is that word length is a rough indicator of sophistication. 2. Calculate and output, accurate to three decimal places, the ratio between the number of distinct words and the total number of words. This is a measure of the variety of language used (although it must be remembered that some authors use words and phrases repeatedly to strengthen their message.) 3. For each word length starting at 1, find the set of words having that length. Print the length, the number of different words having that length, and at most six of these words. If for a certain length, there are six or fewer words, then print all six, but if there are more than six print the first three and the last three in alphabetical order. For example, suppose our simple text example above were expanded to the list ['weather' , 'puppy', 'challenge', 'house', 'whistle', 'nation', 'vest', 'safety', 'house' , 'puppy', 'card', 'weather', 'card', "bike', 'equality', 'justice' , 'pride', 'orange', 'track', 'truck', 'basket' , 'bakery' , 'apples', ' 'bike' , 'truck' , 'horse', 'house', scratch' , 'matter', 'trash'] Then the output should be 1: 0: 2: 0 : 3: 0 : 4: 3: bike card vest 5: 7: horse house pride . . . track trash truck 6: 7: apples bakery basket ... nation orange safety 7: 4: justice scratch weather whistle 8: 1: equality g . 1: challenge 4. Find the distinct word pairs for this document. A word pair is a two-tuple of words that appear max_sep or fewer positions apart in the document list. For example, if the user input resulted in max_sep == 2, then the first six word pairs generated will be: ('puppy' , 'weather' ), ('challenge' , 'weather'), ('challenge' , 'puppy'), ('house' , 'puppy' ) , ('challenge', 'house' ), ('challenge', 'whistle" ) Your program should output the total number of distinct word pairs. (Note that ('puppy' , 'weather' ) and ('weather' , "puppy ') should be considered the same word pair.) It should also output the first 5 word pairs in alphabetical order (as opposed to the order they are formed, which is what is written above) and the last 5 word pairs. You may assume, without checking, that there are enough words to generate these pairs. Here is the output for the longer example above (assuming that the name of the file they are read from is ex2 . txt):Hord pairs for dooment exitxt 54 distinct pairs apples bakery apples basket apples bike apples truck bakery basket puppy Heather safety vest scratch trash track truck rest whistle 5. Finally, as a meuure of how distinct the word pairs are, calculate and output, accurate to three decimal places, the ratio of the number of distinct 'Wm'd pairs to the total number of word pairs. Cornpars Documents The last step is to compare the documents for complexity and similarity. There are many pomible measures, so wewill iniplenlentjustaliew. Beforewedothisweneedtodeneameasureofsimilaritybetween twosets. Ayerycommon one, and the one we use here, is called Jocoard Similarity. This is a sophisticated-sounding name for a very simple concept {something that happens a lot in computer science and other STEM disciplines}. If A and B are two sets, then the laccard similarity is just Mjj Muil In plain English it is the size of the intersection between two sets divided by the size of thur union. Asexamples, ifA and E are equal, JULE} = l, and if}. and E are disjoint, J[A,B] = i]. As a special case, ifoneor hotllof the sets isernpty themeasure is I]. The laccai'd measure'us quite easy to calculate using Python set operations. JEAB] = [1} Here are the comparison measures between documents: 1. Decide which has a greater average word length. This is a rough measure of which uses more scanhisticated language. 2. Calculate the .laocard similarity in the overall word use in the two documents. This should be accurate to three decimal places. 3. Calculate the laccard similarity of word use for each word length. Each output should also be accurate to three decimal places. Ii. Calculate the Jaccard similarity between the word pair sets. The output should be accurate to four decimal places. The documents we study here will not have substantial similarity of pairs, but in other cases this is a useful compmison measure. See the example outputs for details. \fEnter the first file to analyze and compare ==h cat_in_the_hat.txt Enter the second file to analyze and compare ==D pulse_morning.txt Enter the maximum separation between words in a pair ==> 2 Evaluating document cat_in_the_hat.txt 1. Average word length: 3.89 2. Ratio of distinct words to total words: 3.254 3. Word sets for document cat_in_the_hat.txt: 1: B: 2: 4: dr go oh us 3: 52: bad bed bet ... wet yes yet 4: 25: away back ball ... will wish wood 5: 24: asked books bumps ... thump trick white 5: 19: always little looked ... thumps tricks upupup 2: 4: another mothers nothing strings B: B: 9: 2: {uninabox something 19: 1: playthings 4. Mord pairs for document cat_in_the_hat.txt 942 distinct pairs always cat always hat always pick always playthings another game want will wet wet wet wish will will will yes 5. Ratio of distinct word pairs to total: B.E9? Evaluating document pulse_morning.txt 1. Average word length: 5.42 2. Ratio of distinct words to total words: 9.242 3. Mord sets for document pulse_morning.txt: \f

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Transport Operations

Authors: Allen Stuart

2nd Edition

978-0470115398, 0470115394

Students also viewed these Programming questions

Question

1.10. What is a nudge policy? Give an example. (LO22-2)

Answered: 1 week ago