Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Please use Python for the following Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform

Please use Python for the following

Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform initial tokenization and normalization of the document content.

Overview:

Using Python 3 (and supporting libraries), you will scrape the various documents from this website and then segment them as you see fit into appropriate document units. Once you have all of the documents broken out, you will perform an initial pass at tokenization and normalization to generate a rudimentary vocabulary.

Supporting python libraries/packages:

1. Python3

2. Requests //for http requests

3. BeautifulSoup4 //for HTML processing

4. Nltk //for tokenization and text processing

5. Json //for JSON manipulation

You need

1. A directory of your resulting document units. Note: Documents should be extracted from the original source and stored locally

2. Output of your initial vocabulary in JSON format with the corresponding document postings list (saved as a JSON file)

3. Finished code used to obtain and process the documents

Hints: The basic steps

  1. Generate local copies of the documents we want to process from the source location, meaning we need an algorithm to loop through the available anchor ( tags to get to our target documents then save them locally. Keep in mind the relative paths to the subsequent links you will need to track your base URL and do some concatenation or parsing of the paths to get some of the information you will need!
  2. Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
  3. Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
  4. Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions