Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 30, 2024

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import

in python from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup from nltk.tokenize import word_tokenize import re import json def fetchFromURL(url): """ fetch content from URL via HTTP GET request. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: print("Error retrieving information") except RequestException as e: log_error('Error during request to {0}:{1}'.format(url, str(e))) def is_good_response(resp): """ Returns true if response looks like HTML """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ log the errors or you'll regret it later... """ print(e) def main(): url = ('http://shakespeare.mit.edu') #print("Hello World") rawHTML = fetchFromURL(url) #print(rawHTML) soup = BeautifulSoup(rawHTML, 'html.parser') #print(soup) f = open('main.html', 'wb') f.write(rawHTML) f.close() f = open('main.html', 'r') data = f.read() #print(data) soup = BeautifulSoup(data, 'html.parser') extractedText = soup.get_text() tokenizedText = word_tokenize(extractedText) type(tokenizedText) print(tokenizedText) reg = re.compile('[^a-zA-Z]') term = reg.sub('', extractedText) main()

add the following

Once we have the documents, we want to strip out the html markup (using Beautiful soup) so that we have the raw text and tokenize it using the NLTK library.
Once we have our base tokenization, we need to perform our normalization steps (like stripping out or eliminating tokens or special characters we dont want to include, case folding, and sorting etc.).
Finally, we need to add our resulting normalized tokens to our index along with their corresponding postings list and output our resulting index as a JSON file.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Databases A Beginners Guide

Databases A Beginners Guide

Authors: Andy Oppel

1st Edition

007160846X, 978-0071608466

More Books

Students also viewed these Databases questions

Question

★★★★★

Royal Caribbean International, aNorwegianand Americancruise linebrand has made history due to its success in the cruise line industry. It has 22 ships in its service; it controls a 17% share of the...

Answered: 1 week ago

Question

★★★★★

Lukow Products is investigating the purchase of a piece of automated equipment that will save $120,000 each year in direct labor and Inventory carrying costs. This equipment costs $770,000 and is...

Answered: 1 week ago

Question

★★★★★

9. Identify the stage of healing the mother/daughter split in Girl, Interrupted.

Answered: 1 week ago

Question

★★★★★

Using the future value tables, solve the following: Required 1. What is the future value on December 31, 2011 of a deposit of $35,000 made on December 31, 2007 assuming interest of 10% compounded...

Answered: 1 week ago

Question

★★★★★

Use the table below to answer this question. MACRS 5-year property Year Rate 1 20.00% 2 32.00% 3 19.20% 4 11.52% 5 11.52% 6 5.76% Ronnie's Custom Cars purchased some fixed assets two years ago for...

Answered: 1 week ago

Question

★★★★★

A _____________is a source or supply mostly comprised of a part of earth that we value and need for life. A _____________is a source or supply mostly comprised of a part of earth that we value and...

Answered: 1 week ago

Question

★★★★★

W A planar wall (kw = 10- mK' , Lw = 0.5 m) is insulated on one side (q(0) = 0), with a known temperature on both sides (T(0) = 200C, T(L) = 50C). You know that the temperature profile is as follows:...

Answered: 1 week ago

Question

★★★★★

A county government hires lawyers to defend itself in lawsuits. The local government provides its legal staff with an office building. The table below shows how many cases can be handled with...

Answered: 1 week ago

Question

★★★★★

We are at the halfway point of this course and nearly through your time with SDI. It is the perfect time to pause and reflect. Reflect upon the learning experiences you have had so far in your SDI...

Answered: 1 week ago

Question

★★★★★

You are a project manager at Unisa Mining Solutions (UMS) and you are in a fairly good mood. Your firm develops systems to help miming firms reduce their exploration costs, and you have just returned...

Answered: 1 week ago

Question

★★★★★

For the following data, draw (by hand) the histograms for X1 and X2 and the scatterplot for X1 versus X2. Which variable(s) do think is (are) normal? Explain. X1 X2 3.9 2.5 2.7 5.1 3.4 5.6 3.3 7 3.4...

Answered: 1 week ago

Question

★★★★★

Some have speculated that in addition to increasing the validity of decisions, employing rigorous selection methods has symbolic value for organizations. What message is sent to applicants about the...

Answered: 1 week ago

Question

★★★★★

Discuss how the following trends are changing the skill requirements for managerial jobs in the United States: (a) increasing use of computers, (b) increasing international competition.

Answered: 1 week ago

Question

★★★★★

Does it have correct contact information?

Answered: 1 week ago

Previous Question Next Question