Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Python Project Files: https://drive.google.com/drive/folders/1n_B7qjez_fGbf6xOq9841v1NdbxyNqVN?usp=sharing Your task is to implement a simplified general purpose aggregator. An aggregator is a program that systematically compiles a specific type

Python Project Files: https://drive.google.com/drive/folders/1n_B7qjez_fGbf6xOq9841v1NdbxyNqVN?usp=sharing

Your task is to implement a simplified general purpose aggregator. An aggregator is a program that systematically compiles a specific type of information from multiple online sources.

The aggregator will take a filename and a topic as command line arguments. The file will contain a possibly very long list of online sources (urls). The topic will be a string such as flu, football, etc...

The aggregator will issue an error message if the command line arguments provided are too few or too many.

Our aggregator will open and read the urls contained in the file, and it will report back on the subset of urls that contain a reference to the specified topic. The program will put the output in a text file. That output will contain both the urls and the text containing the reference.

The output file will be created in the working directory. Its name will consist of the topic followed by summary.txt. So when the topic is flu, the output filename will be flusummary.txt, when the topic is football, the output filename will be footballsummary.txt.

Since our program is reading html documents and we are interested in actual text (not in matches found inside html tags), the text containing the reference will be delimited by the innermost angle brackets: >text containing the topic to capture<. the angle brackets should of course not be included in captured text.>

Since our aggregator will be reading urls on the open web, it may encounter errors: it must handle both URLError and DecodeError and generate the appropriate message for each.

Start with the template file aggregator.py.

You must be able to invoke your aggregator from the terminal by typing:

python aggregator.py some_filename some_topic

Testing:

Make sure you test your module before submitting it!

While your aggregator must work with any file containing source urls, I have archived some urls and included them in the file sources.txt image text in transcribed to make it easier for you to verify the output. In addition to some archived urls (from California colleges websites) I have included two invalid urls so that you can test error handling in your program. I have also included the expected output for the test cases below. Make sure your output files match the expected output and make sure the error messages shown below are printed to the terminal.

Test case 1:

python aggregator.py sources.txt art

The following error messages should be generated:

Error opening url: http://invalidurlurlcs21a.com/

Error decoding url: http://www.deanza.edu/counseling/documents/Substitution%20Petition.pdf 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte

The output file (artsummary.txt) should match the file artsummary.txt image text in transcribed .

Make sure you pick up references to Art and art and make sure you do NOT pick up the reference to arts.

Make sure you pick up the reference to Art when it is followed by punctuation as in: Recent acquisitions by the Art, Design...

Test case 2:

python aggregator.py sources.txt climate

The following error messages should be generated:

Error opening url: http://invalidurlurlcs21a.com/

Error decoding url: http://www.deanza.edu/counseling/documents/Substitution%20Petition.pdf 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte

The output file (climatesummary.txt) should match the file climatesummary.txt image text in transcribed . Make sure you pick up reference to Climate when it immediately follows the angle bracket as in: >Climate change will be an economic disaster for rich and poor, new study says

Make sure you do not pick up references to the topic inside html tags as in:

Test case 3:

python aggregator.py sources.txt security

The following error messages should be generated:

Error opening url: http://invalidurlurlcs21a.com/

Error decoding url: http://www.deanza.edu/counseling/documents/Substitution%20Petition.pdf 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte

The output file (securitysummary.txt) should match the file securitysummary.txt image text in transcribed .

Make sure you pick up the reference to security when it immediately precedes the angle bracket as in: At Stanford, Susan Rice talks about climate change and national security

Make sure you pick up the reference to Security when the angle bracket is on a different line as in: >Awareness and diligence are keys to cyber security in the digital age, say UC Santa Barbara researchers

Test case 4:

python aggregator.py sources.txt football

The following error messages should be generated:

Error opening url: http://invalidurlurlcs21a.com/

Error decoding url: http://www.deanza.edu/counseling/documents/Substitution%20Petition.pdf 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte

The output file footballsummary.txt should be empty since none of the source urls include any reference to football.

Test case 5:

The program issues an error message and terminates without an exception if no command line argument is provided

python aggregator.py

Error: invalid number of arguments

Usage: aggregator.py filename topic

Test case 6:

The program issues an error message if only one command line argument is provided.

python aggregator.py sources.txt

Error: invalid number of arguments

Usage: aggregator.py filename topic

Test case 7:

The program issues an error message if the command line arguments provided are too many.

python aggregator.py sources.txt flu climate

Error: invalid number of arguments

Usage: aggregator.py filename topic

Important:

For test cases 1 through 3, it is OK to get a slightly different error message for the error opening the url. The following messages are also valid:

Error opening url: http://invalidurlurlcs21a.com/

or:

Error opening url: http://invalidurlurlcs21a.com/

Hints:

Each line in sources.txt contains one source url. There is no need to use a regular expression to extract the urls. If you read sources.txt one line at a time, you get one url at a time. You can then fetch and process one url at a time.

Since the source file may contain a very long list of urls, it is best to read and process one url at a time, instead of storing the urls and then processing them.

When handling exceptions, it is best to use the else clause for code that must be executed if the try clause succeeds.

Section 20.2 contains an example on how to separate html content from html tags.

The example in section 19.13 goes through the process of writing a regular expression and a function that solve a problem very similar to the one we have in this assignment. Just keep in mind that parentheses in that example are escaped - '\(' is one character. Since in this assignment we are looking for referenced text outside angle brackets (and not inside parentheses) , an escaped opening parenthesis - \( - would correspond to a closing angle bracket >. An escaped closing parenthesis - \) - would correspond to an opening angle bracket <. unescaped parentheses in that example are used for grouping and should not be converted to angle brackets.>