Answered step by step
Verified Expert Solution
Question
1 Approved Answer
What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result
What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result to an output file. Specifications: 1. Start crawling from 'http://www.cdm.depaul.edu/' 2. Never visit the same page more than once. 3. Visit pages that are only WITHIN the cdm domain -the url's that have "http://www.cdm.depaul.edu/" in the beginning of the absolute URL. Do not visit external sites. 4. When you process the 'data' (processed by the 'handle_data(data)' function defined in the Python HTMLParser class, which is inherited in your 'Collector' class; assuming you used the code shown in the lecture PPT), convert all data to lower case. 5. In the 50 most common words, DO NOT include stopwords (e.g. 'the', 'a'). Stopwords, for the purpose of our assignment, are defined in the file "M6_stopwords.txt" (newly) posted on D2L, under the Module 06 Assignments. Some Hints: After creating an absolute url (in Collector), if the final url contains either 'mailto' or 'img' or "course-evaluations', do NOT traverse the link. If you do, your code will error. In Python HTMLParser, when feed() is called, the order of the tag/data detection sequence is: 1st handle_starttag() 2nd handle_data() 3rd handle_endtag() -- A big annoying difficulty is that, when you access the data returned from handle_data(), which you will have to override in your Collector class, the data was from irrelevant/unwanted sections, such as a section started by the tag , , , . You do NOT want to process data from those sections. To that goal, what you can do is to first store the tag that was detected in the handle_starttag(). Then when handle_data() is invoked (automatically), you check the tag you stored for the data section, and if the tag was not one of the unwanted tags, you ignore the data extracted from the (tagged) section. You can use this list of unwanted tags: ['script', 'noscript', 'input', 'meta', 'title', 'style', 'form']. Be sure to remove punctuations, such as ,,,,?. '!', from the tokens in data. Note that there could be any number of punctuations (not just one) given to a word, such as "okay?!!" and "
Step by Step Solution
★★★★★
3.39 Rating (161 Votes )
There are 3 Steps involved in it
Step: 1
Here is the Python code to find the 50 most common words and their frequencies on the CDM website PYTHON import requests from bs4 import BeautifulSoup ...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started