Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 07, 2024

Minor Assignment #2: Regular Expression Submission Instructions: 1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program

Minor Assignment #2: Regular Expression

Submission Instructions:

1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program really works.

2. Submit individual files. DO NOT SUBMIT A ZIP FILE.

Problem:

Write a batch script, which combines a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.

The execution of the script generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.

126 that 128 by 133 as 149 or 160 for 164 is 189 on 191 from 345 to 375 advertising 443 a 473 and 480 in 677 of 1080 the

Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . You can use the following command to download and save the page in AC.html:

wget https://en.wikipedia.org/wiki/AC -O AC.html

A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use lynx to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt

lynx -dump nolist AC.html > AC.txt

After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. You may need sort, cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Database Concepts

Authors: David Kroenke, David J. Auer

3rd Edition

0131986252, 978-0131986251

More Books

Students also viewed these Databases questions

Question

★★★★★

Dewi Hartono is an assistant controller at Giant Engineering, which contracts with the Defense Department to build and maintain roads on military bases. Dewi recently determined that the company was...

Answered: 1 week ago

Question

★★★★★

Calculate sufficient ratios for both 2008 and 2007 to demonstrate the changes in profitability, liquidity, efficiency, gearing and shareholder return of Tamalan plc and comment on the most important...

Answered: 1 week ago

Question

★★★★★

3. As a sport psychologist, how would you suggest working with an athlete who suffers from self-attention in front of the home crowd?

Answered: 1 week ago

Question

★★★★★

Herman and Sons Law Offices opened on January 1, 2015. During the first year of business, the company had the following transactions: January 2: The owners invested $ 250,000 (the par value of the...

Answered: 1 week ago

Question

★★★★★

Waterway Industries produced 224000 units in 134000 direct labor hours. Production for the period was estimated at 202000 units and 111000 direct labor hours. A flexible budget would compare budgeted...

Answered: 1 week ago

Question

★★★★★

Marginal Incorporated ( MI ) has determined that its before - tax cost of debt is 6 . 0 % for the first $ 1 3 4 million in bonds it issues, and 1 0 . 0 % for any bonds issued above $ 1 3 4 million....

Answered: 1 week ago

Question

★★★★★

a. Draw a triangle similar to the one drawn opposite. (i) Measure the length of side AB and let this be the base of the triangle. (ii) Draw on your diagram a line which represents the height of the...

Answered: 1 week ago

Question

★★★★★

The following data show the education and employment status of women aged 2029 (from the 1991 General Household Survey): (a) Draw a bar chart of the numbers in work in each education category. Can...

Answered: 1 week ago

Question

★★★★★

In promiscuous species of mammals, such as deer mouse, the breeding success of males can be predicted from their testes size and the number of sperm they produce. Fisher et al. (2018) investigated...

Answered: 1 week ago

Question

★★★★★

Implicit in the stop-and-wait scenarios of Figure 2.14 is the notion that the receiver will retransmit its ACK immediately on receipt of the duplicate data frame. Suppose instead that the receiver...

Answered: 1 week ago

Question

★★★★★

The following information pertains to Bay Cabinet Company's sales on account and accounts receivable: After several collection attempts, Bay Cabinet Company wrote off \(\$ 2,800\) of accounts that...

Answered: 1 week ago

Question

★★★★★

Apply your own composing style to personalize your messages.

Answered: 1 week ago

Question

★★★★★

Format memos and e-mail properly.

Answered: 1 week ago

Question

★★★★★

Discuss the characteristics of appropriate stationery for letters,memos, and envelopes.

Answered: 1 week ago

Previous Question Next Question