Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 09, 2024

Please use Python for the following Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform

Please use Python for the following

Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform initial tokenization and normalization of the document content.

Overview:

Using Python 3 (and supporting libraries), you will scrape the various documents from this website and then segment them as you see fit into appropriate document units. Once you have all of the documents broken out, you will perform an initial pass at tokenization and normalization to generate a rudimentary vocabulary.

Supporting python libraries/packages:

1. Python3

2. Requests //for http requests

3. BeautifulSoup4 //for HTML processing

4. Nltk //for tokenization and text processing

5. Json //for JSON manipulation

You need

1. A directory of your resulting document units. Note: Documents should be extracted from the original source and stored locally

2. Output of your initial vocabulary in JSON format with the corresponding document postings list (saved as a JSON file)

3. Finished code used to obtain and process the documents

Hints: The basic steps

Generate local copies of the documents we want to process from the source location, meaning we need an algorithm to loop through the available anchor ( tags to get to our target documents then save them locally. Keep in mind the relative paths to the subsequent links you will need to track your base URL and do some concatenation or parsing of the paths to get some of the information you will need!
Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

How Do I Use A Database Research Tools You Can Use

How Do I Use A Database Research Tools You Can Use

Authors: Laura La Bella

1st Edition

1622753763, 978-1622753765

More Books

Students also viewed these Databases questions

Question

★★★★★

Its an important tradition in the Santos family that they eat the same meal at their favorite restaurant every Sunday. By contrast, the Chen family spends exactly $50 for their Sunday meal at...

Answered: 1 week ago

Question

★★★★★

I will appreciate if someone can help me with the procediment to find this answer Baxter Company reported a net loss of $13,000 for the year ended December 31. During the year, accounts receivable...

Answered: 1 week ago

Question

★★★★★

8.33 A Chemistry Experiment Due to a variation in laboratory techniques, impurities in materials, and other unknown factors, the results of an experiment in a chemistry laboratory will not always...

Answered: 1 week ago

Question

★★★★★

On April 18, Riley Co. made a short- term investment in 300 common shares of XLT Co. The purchase price is $ 42 per share and the brokers fee is $ 250. The intent is to actively manage these shares...

Answered: 1 week ago

Question

★★★★★

Peter is a sole trader who sells second-hand cars with a business name of Peters car dealers. The revenue of the business has increased from $500,000 p.a. in the first year to $10m in the second...

Answered: 1 week ago

Question

★★★★★

Multiple-Prodsct Breakeven Parker Pcttery products a line of vases and a line of curamic figurines. Each line uses the same equipment and labor: henct, there are no traceable fixed costs. Common...

Answered: 1 week ago

Question

★★★★★

Gigabytes Computers, Inc, manufactures two styles of computer keyboards, the multimedia Keyboard and the Comfort Type keyboard. The company uses activity based costing to allocat overhead costs to...

Answered: 1 week ago

Question

★★★★★

Evelyn monitors the number of influenza infections reported in a city in a given week The recent numbers are shown in this table: Week Number of People 0 16 1 32 2 64 3 128 The reported infections...

Answered: 1 week ago

Question

★★★★★

Scenario: Generally, we have considered sexual harassment actions or verbal abuse of women to be done by men. Over the past several decades, the culture of society has evolved. The diversity of sex...

Answered: 1 week ago

Question

★★★★★

Global cities are performing a differentiated function within a global network of exchange and they play specific roles in specific networks. What are examples of such roles and networks? Taipei and...

Answered: 1 week ago

Question

★★★★★

Nomsa feels her low performance rating is unfair as she was not aware of all the concerns and deadlines. During which stage can Nomsa raise her dissatisfaction? a. Performance contract b. Performance...

Answered: 1 week ago

Question

★★★★★

. A very sweet pie made from molasses that originated with the Pennsylvania Dutch: a. Mincemeat pie b. Sugar pie c. Shoofly pie d. Lancaster pie

Answered: 1 week ago

Question

★★★★★

4. Which of the following is not the name of a Native American tribe? a. Seminole b. Apache c. Arapaho d. Illini

Answered: 1 week ago

Question

★★★★★

8. Explain the relationship between communication and context.

Answered: 1 week ago

Previous Question Next Question