Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import json def fetchFromURL(url): """ fetch content from URL via HTTP GET request. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: print("Error retrieving information") except RequestException as e: log_error('Error during request to {0}:{1}'.format(url, str(e))) def is_good_response(resp): """ Returns true if response looks like HTML """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ log the errors or you'll regret it later... """ print(e) def main(): url = ('http://shakespeare.mit.edu') #print("Hello World") rawHTML = fetchFromURL(url) #print(rawHTML) soup = BeautifulSoup(rawHTML, 'html.parser') #print(soup) f = open('main.html', 'wb') f.write(rawHTML) f.close() f = open('main.html', 'r') data = f.read() #print(data) soup = BeautifulSoup(data, 'html.parser') extractedText = soup.get_text() tokenizedText = word_tokenize(extractedText) type(tokenizedText) print(tokenizedText) reg = re.compile('[^a-zA-Z]') term = reg.sub('', extractedText) main()

add the following

  • Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
  • Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
  • Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Databases A Beginners Guide

Authors: Andy Oppel

1st Edition

007160846X, 978-0071608466

More Books

Students also viewed these Databases questions

Question

Does it have correct contact information?

Answered: 1 week ago