Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

You are supposed to use the same dataset as used in Assignment 1 for this question. Download the stories dataset from the given link: http:

You are supposed to use the same dataset as used in Assignment 1 for this question.

Download the stories dataset from the given link: http://archives.textfiles.com/stories.zip 

The data set consists of 467 files and has a size of about 15MB (including SRE and remaining

files). The Farnon folder is excluded from the dataset. Ignore index.html in the stories folder.

1) Carry out the following preprocessing steps on the given dataset

i. Convert the text to lower case

ii. Perform word tokenization

iii. Remove stopwords from tokens

iv. Remove punctuation marks from tokens

v. Remove blank space tokens

b. Implement the positional index data structure

c. Provide support for the searching of phrase queries. You may assume query length to be

less than or equal to 5.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Dental Assisting

Authors: Doni Bird, Debbie Robinson

13th Edition

978-0323624855, 0323624855

Students also viewed these Programming questions