Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result

What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result to an output file. Specifications: 1. Start crawling from 'http://www.cdm.depaul.edu/' 2. Never visit the same page more than once. 3. Visit pages that are only WITHIN the cdm domain -the url's that have "http://www.cdm.depaul.edu/" in the beginning of the absolute URL. Do not visit external sites. 4. When you process the 'data' (processed by the 'handle_data(data)' function defined in the Python HTMLParser class, which is inherited in your 'Collector' class; assuming you used the code shown in the lecture PPT), convert all data to lower case. 5. In the 50 most common words, DO NOT include stopwords (e.g. 'the', 'a'). Stopwords, for the purpose of our assignment, are defined in the file "M6_stopwords.txt" (newly) posted on D2L, under the Module 06 Assignments. Some Hints: After creating an absolute url (in Collector), if the final url contains either 'mailto' or 'img' or "course-evaluations', do NOT traverse the link. If you do, your code will error. In Python HTMLParser, when feed() is called, the order of the tag/data detection sequence is: 1st handle_starttag() 2nd handle_data() 3rd handle_endtag() -- A big annoying difficulty is that, when you access the data returned from handle_data(), which you will have to override in your Collector class, the data was from irrelevant/unwanted sections, such as a section started by the tag , , , . You do NOT want to process data from those sections. To that goal, what you can do is to first store the tag that was detected in the handle_starttag(). Then when handle_data() is invoked (automatically), you check the tag you stored for the data section, and if the tag was not one of the unwanted tags, you ignore the data extracted from the (tagged) section. You can use this list of unwanted tags: ['script', 'noscript', 'input', 'meta', 'title', 'style', 'form']. Be sure to remove punctuations, such as ,,,,?. '!', from the tokens in data. Note that there could be any number of punctuations (not just one) given to a word, such as "okay?!!" and "

Step by Step Solution

3.39 Rating (161 Votes )

There are 3 Steps involved in it

Step: 1

Here is the Python code to find the 50 most common words and their frequencies on the CDM website PYTHON import requests from bs4 import BeautifulSoup ... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Income Tax Fundamentals 2013

Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill

31st Edition

1111972516, 978-1285586618, 1285586611, 978-1285613109, 978-1111972516

More Books

Students also viewed these Programming questions

Question

Q14-3. Define net operating profit after tax (NOPAT).

Answered: 1 week ago