Answered step by step
Verified Expert Solution
Question
1 Approved Answer
in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import
in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import json def fetchFromURL(url): """ fetch content from URL via HTTP GET request. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: print("Error retrieving information") except RequestException as e: log_error('Error during request to {0}:{1}'.format(url, str(e))) def is_good_response(resp): """ Returns true if response looks like HTML """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ log the errors or you'll regret it later... """ print(e) def main(): url = ('http://shakespeare.mit.edu') #print("Hello World") rawHTML = fetchFromURL(url) #print(rawHTML) soup = BeautifulSoup(rawHTML, 'html.parser') #print(soup) f = open('main.html', 'wb') f.write(rawHTML) f.close() f = open('main.html', 'r') data = f.read() #print(data) soup = BeautifulSoup(data, 'html.parser') extractedText = soup.get_text() tokenizedText = word_tokenize(extractedText) type(tokenizedText) print(tokenizedText) reg = re.compile('[^a-zA-Z]') term = reg.sub('', extractedText) main()
add the following
- Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
- Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
- Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started