Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 29, 2023

What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result

What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result to an output file. Specifications: 1. Start crawling from 'http://www.cdm.depaul.edu/' 2. Never visit the same page more than once. 3. Visit pages that are only WITHIN the cdm domain -the url's that have "http://www.cdm.depaul.edu/" in the beginning of the absolute URL. Do not visit external sites. 4. When you process the 'data' (processed by the 'handle_data(data)' function defined in the Python HTMLParser class, which is inherited in your 'Collector' class; assuming you used the code shown in the lecture PPT), convert all data to lower case. 5. In the 50 most common words, DO NOT include stopwords (e.g. 'the', 'a'). Stopwords, for the purpose of our assignment, are defined in the file "M6_stopwords.txt" (newly) posted on D2L, under the Module 06 Assignments. Some Hints: After creating an absolute url (in Collector), if the final url contains either 'mailto' or 'img' or "course-evaluations', do NOT traverse the link. If you do, your code will error. In Python HTMLParser, when feed() is called, the order of the tag/data detection sequence is: 1st handle_starttag() 2nd handle_data() 3rd handle_endtag() -- A big annoying difficulty is that, when you access the data returned from handle_data(), which you will have to override in your Collector class, the data was from irrelevant/unwanted sections, such as a section started by the tag , , , . You do NOT want to process data from those sections. To that goal, what you can do is to first store the tag that was detected in the handle_starttag(). Then when handle_data() is invoked (automatically), you check the tag you stored for the data section, and if the tag was not one of the unwanted tags, you ignore the data extracted from the (tagged) section. You can use this list of unwanted tags: ['script', 'noscript', 'input', 'meta', 'title', 'style', 'form']. Be sure to remove punctuations, such as ,,,,?. '!', from the tokens in data. Note that there could be any number of punctuations (not just one) given to a word, such as "okay?!!" and "

Step by Step Solution

★★★★★

3.39 Rating (161 Votes )

There are 3 Steps involved in it

Step: 1

Here is the Python code to find the 50 most common words and their frequencies on the CDM website PYTHON import requests from bs4 import BeautifulSoup ... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Income Tax Fundamentals 2013

Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill

31st Edition

1111972516, 978-1285586618, 1285586611, 978-1285613109, 978-1111972516

More Books

Students also viewed these Programming questions

Question

★★★★★

Case Study: Quick Fix Dental Practice Technology requirements Application must be built using Visual Studio 2019 or Visual Studio 2017, professional or enterprise. The community edition is not...

Answered: 1 week ago

Question

★★★★★

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Answered: 1 week ago

Question

★★★★★

There are 4 models of Corporate Social Responsibility. I. Economic Model a. Profit-Based SocialResponsibility aka Economic Model : Milton Friedman's 1970New York Times article "The Social...

Answered: 1 week ago

Question

★★★★★

47GP: Chapter: CH0 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 CH9 CH10 CH11 CH12 CH13 CH14 CH15 CH16 CH17 CH18 CH19 CH20 CH21 CH22 CH23 CH24 CH25 CH26 CH27 CH28 CH29 CH30 Problem: 1CQ 1MCP 1P 2CQ 2MCP 2P 3CQ...

Answered: 1 week ago

Question

★★★★★

Now assume that Temp Forces dividend is expected to experience super normal growth of 30% from Year 0 to Year 1, 20% from Year 1 to Year 2, and 10% from Year 2 to Year 3. After Year 3, dividends will...

Answered: 1 week ago

Question

★★★★★

Q14-3. Define net operating profit after tax (NOPAT).

Answered: 1 week ago

Question

★★★★★

Does the normal probability plot given in Figure 9.14, which resulted from fitting the model ???? = ????0 + ????1????1 + ????2????2 + ????, support the normality assumption? Explain. 99.9 99 95 90 80...

Answered: 1 week ago

Question

★★★★★

Family Supermarkets (FS) has a kaizen (continuous improvement) approach to budgeting monthly activity costs for each month of 2011. Each successive month, the budgeted cost-driver rate decreases by...

Answered: 1 week ago

Question

★★★★★

part 1-6 help! Aa- Star, lne uses a standard cost system and provides the following information. (Cick the icon to view the iniormation.) Al- Star alocates manitacturing ovechend to production based...

Answered: 1 week ago

Question

★★★★★

Determine the objectives of the test. You are auditing both quantities and pricing of the final inventory accumulation. a. Explain the assertions that you are testing. b. Explain the evidence that...

Answered: 1 week ago

Question

★★★★★

Last year, you spent $10,000 researching the market to decide whether to begin a new business. You rent a space for $40,000 a year, and pay an employee $50,000 a year with a 1 year contract. If you...

Answered: 1 week ago

Question

★★★★★

Charles had been drinking heavily and had an argument with his ex-girlfriend. He threatened to burn her house down. A short time later he set fire to some cardboard boxes on the porch, causing...

Answered: 1 week ago

Question

★★★★★

A company produces three products, Y1, Y2, and Y3 in the same process.The data below reflects average monthly results: Y1 Y2 Y3 Monthly output (kg) 40,000 20,000 20,000 Sales Value at split off...

Answered: 1 week ago

Question

★★★★★

By implementing the JIT system, a pull system technique will be used to ensure a smooth flow on the factory floor. JIT system implementation requires more than just the production floor; it also...

Answered: 1 week ago

Question

★★★★★

You will apply operations management ideas to real-world scenarios in this exercise. This course will allow you to start constructing an operation management strategy by writing a paper on a company...

Answered: 1 week ago

Question

★★★★★

X-Corp. uses first in first out for its process costing Materials are issued 100% when work order is created Beginning WIP: conversion costs 50% complete Ending WIP : conversion costs 30% complete...

Answered: 1 week ago

Question

★★★★★

Your company will generate $76,000 in annual revenue each year for the next seven years from a new information database. If the appropriate interest rate is 7.75 percent, what is the present value of...

Answered: 1 week ago

Question

★★★★★

You have just begun your summer internship at Omni Instruments. The company supplies sterilized surgical instruments for physicians. To expand sales, Omni is considering paying a commission to its...

Answered: 1 week ago

Question

★★★★★

Phil and Linda are 25-year-old newlyweds and file a joint tax return. Linda is covered by a retirement plan at work, but Phil is not. a. Assuming Phil's wages were $27,000 and Linda's wages were...

Answered: 1 week ago

Question

★★★★★

Calculate the amount of the child and dependent care credit allowed for 2012 in each of the following cases, assuming the taxpayers had no income other than the stated amounts. a. William and Carla...

Answered: 1 week ago

Question

★★★★★

Tom has a successful business with $100,000 of income in 2012. He purchases one new asset in 2012, a new machine which is 7-year MACRS property and costs $25,000. If you are Tom's tax advisor, how...

Answered: 1 week ago

Question

★★★★★

As an intern for Intel Corporation, suppose you have been asked to help the vice president prepare a newsletter to the shareholders. You have been given access to the data in a file called Intel that...

Answered: 1 week ago

Question

★★★★★

Go to your university library and obtain the Statistical Abstract of the United States. a. Construct a frequency distribution for unemployment rate by state for the most current year available. b....

Answered: 1 week ago

Question

★★★★★

The State Industrial Development Council is currently working on a financial services brochure to send to out-of-state companies. It is hoped that the brochure will be helpful in attracting companies...

Answered: 1 week ago

Previous Question Next Question