Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

This assignment is to be done individually. You cannot use code written by your classmates. Use code found over the Internet at your own peril

This assignment is to be done individually. You cannot use code written by your classmates. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism. Use EdDiscussion for general questions whose answers can benefit you and everyone.
General Specifications
You must write this assignment in Python (preferably 3.6+, as these versions are installed in the openlab.ics.uci.edu machines; make sure your program works at openlab as that environment might be used to test your code).
Make sure to break down your program into classes/methods/functions corresponding to the parts in this specification. They will be tested separately.
The function signatures in this specification are informal; their purpose is to explain the inputs and outputs of the methods.
When you are developing, use git locally to keep track of your development process (if you don't know how to use git, see the end of this document). You must submit your local git repository together with the assignment, otherwise, you might not get a grade.
Very important: At certain points, the assignment may be underspecified - this is by design. In those cases, make your own choices and assumptions and be prepared to defend them in case you are questioned about them (but keep in mind that you cannot go against what is specified here; i.e., you can do more than what was specified, but you cannot do less).
Part A: Word Frequencies (40 points)
Method/Function: List tokenize(TextFilePath)
Write a method/function that reads in a text file and returns a list of the tokens in that file. For the purposes of this project, a token is a sequence of alphanumeric characters, independent of capitalization (so Apple, apple, aPpLe are the same token). You are allowed to use regular expressions if you wish to (and you can use some regexp engine, no need to write it from scratch), but you are not allowed to import a tokenizer (e.g. from NLTK), since you are being asked to write a tokenizer.
Method/Function: Map computeWordFrequencies(List)
Write another method/function that counts the number of occurrences of each token in the token list. Remember that you should write this assignment yourself from scratch, so you are not allowed to import a counter when the assignment asks you to write that method.
Method/Function: void print(Frequencies)
Finally, write a method/function that prints out the word frequency count onto the screen. The printout should be ordered by decreasing frequency (so, the highest frequency words first; if necessary, order the cases of ties alphabetically).
The TA will use their own test text files. For this part, it is expected that your program will read this text file, tokenize it, count the tokens, and print out the token (word) frequencies. Your program must run from the command line: write a program that takes one text file as an argument and outputs the token frequencies.
Please, use one of the following output format examples when you print out the result:
\t
-
=
>
->
=>
Part B: Intersection of two files (60 points)
Write a program that takes two text files from the command line as arguments and outputs the number of tokens they have in common. Here is an example of input/output:
Example Input.png
You can reuse the code you wrote for part A (remember that you can import files, avoiding, thus, code duplication!).
The TA will use their own text files. Note that some of the text files may be VERY LARGE, so make sure your program is not dependent on reading the entire files to the computer RAM.
For this part, programs that perform better will be given more credit than those that perform poorly.
Common Tasks and Important Notes
For both part A and part B, please add a brief runtime complexity explanation for your code as a comment on top of each method or function (does it run in linear time relative to the size of the input? Polynomial-time? Exponential-time? ). This explanation, the comments written throughout your code, and your code's actual conformance with this explanation will be the basis for evaluating the performance of your program.
You should get the file names from command line arguments. Do not hard-code the input file names in your code or read them from system standard input (stdin). As the assignment will be graded using an automatic grader, not doing this may result in losing the whole credit for the assignment (this means getting a zero in the entire assignment).
Exception handling is required for bad inputs. An example of bad input would be a character in a non-English language. Your code should be able to tokenize the whole input file, even though there may

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems A Practical Approach To Design Implementation And Management

Authors: THOMAS CONNOLLY

6th Edition

9353438918, 978-9353438913

More Books

Students also viewed these Databases questions

Question

What attracts you about this role?

Answered: 1 week ago

Question

How many states in India?

Answered: 1 week ago

Question

HOW IS MARKETING CHANGING WITH ARTIFITIAL INTELIGENCE

Answered: 1 week ago

Question

1. What is game theory?

Answered: 1 week ago