Question
Minor Assignment #2: Regular Expression Submission Instructions: 1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program
Minor Assignment #2: Regular Expression
Submission Instructions:
1. Submit two files: 1) your program, and 2) a screenshot in jpg format showing that your program really works.
2. Submit individual files. DO NOT SUBMIT A ZIP FILE.
Problem:
Write a batch script, which combines a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.
The execution of the script generates a list of distinct words used in the wikipedia pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in ascending order. The following is a sample of output generated for 4 Wikipedia pages.
126 that 128 by 133 as 149 or 160 for 164 is 189 on 191 from 345 to 375 advertising 443 a 473 and 480 in 677 of 1080 the
Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . You can use the following command to download and save the page in AC.html:
wget https://en.wikipedia.org/wiki/AC -O AC.html
A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use lynx to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt
lynx -dump nolist AC.html > AC.txt
After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. You may need sort, cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started