Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Minor Assignment #2: Regular Expression Submission Instructions: 1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program

Minor Assignment #2: Regular Expression

Submission Instructions:

1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program really works.

2. Submit individual files. DO NOT SUBMIT A ZIP FILE.

Problem:

Write a batch script, which combines a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.

The execution of the script generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.

126 that 128 by 133 as 149 or 160 for 164 is 189 on 191 from 345 to 375 advertising 443 a 473 and 480 in 677 of 1080 the

Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . You can use the following command to download and save the page in AC.html:

wget https://en.wikipedia.org/wiki/AC -O AC.html

A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use lynx to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt

lynx -dump nolist AC.html > AC.txt

After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. You may need sort, cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David Kroenke, David J. Auer

3rd Edition

0131986252, 978-0131986251

More Books

Students also viewed these Databases questions

Question

Apply your own composing style to personalize your messages.

Answered: 1 week ago

Question

Format memos and e-mail properly.

Answered: 1 week ago