Question
Please use Python for the following Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform
Please use Python for the following
Retrieve a collection of documents from a remote source (http://shakespeare.mit.edu.), organize the data into appropriate document units, and perform initial tokenization and normalization of the document content.
Overview:
Using Python 3 (and supporting libraries), you will scrape the various documents from this website and then segment them as you see fit into appropriate document units. Once you have all of the documents broken out, you will perform an initial pass at tokenization and normalization to generate a rudimentary vocabulary.
Supporting python libraries/packages:
1. Python3
2. Requests //for http requests
3. BeautifulSoup4 //for HTML processing
4. Nltk //for tokenization and text processing
5. Json //for JSON manipulation
You need
1. A directory of your resulting document units. Note: Documents should be extracted from the original source and stored locally
2. Output of your initial vocabulary in JSON format with the corresponding document postings list (saved as a JSON file)
3. Finished code used to obtain and process the documents
Hints: The basic steps
- Generate local copies of the documents we want to process from the source location, meaning we need an algorithm to loop through the available anchor ( tags to get to our target documents then save them locally. Keep in mind the relative paths to the subsequent links you will need to track your base URL and do some concatenation or parsing of the paths to get some of the information you will need!
- Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
- Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
- Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started