Answered step by step
Verified Expert Solution
Question
1 Approved Answer
This assignment is to be done individually. You cannot use code written by your classmates. Use code found over the Internet at your own peril
This assignment is to be done individually. You cannot use code written by your classmates. Use code found over the Internet at your own peril it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism. Use EdDiscussion for general questions whose answers can benefit you and everyone. General Specifications You must write this assignment in Python preferably as these versions are installed in the openlab.ics.uci.edu machines; make sure your program works at openlab as that environment might be used to test your code Make sure to break down your program into classesmethodsfunctions corresponding to the parts in this specification. They will be tested separately. The function signatures in this specification are informal; their purpose is to explain the inputs and outputs of the methods. When you are developing, use git locally to keep track of your development process if you don't know how to use git, see the end of this document You must submit your local git repository together with the assignment, otherwise, you might not get a grade. Very important: At certain points, the assignment may be underspecified this is by design. In those cases, make your own choices and assumptions and be prepared to defend them in case you are questioned about them but keep in mind that you cannot go against what is specified here; ie you can do more than what was specified, but you cannot do less Part A: Word Frequencies points MethodFunction: List tokenizeTextFilePath Write a methodfunction that reads in a text file and returns a list of the tokens in that file. For the purposes of this project, a token is a sequence of alphanumeric characters, independent of capitalization so Apple, apple, aPpLe are the same token You are allowed to use regular expressions if you wish to and you can use some regexp engine, no need to write it from scratch but you are not allowed to import a tokenizer eg from NLTK since you are being asked to write a tokenizer. MethodFunction: Map computeWordFrequenciesList Write another methodfunction that counts the number of occurrences of each token in the token list. Remember that you should write this assignment yourself from scratch, so you are not allowed to import a counter when the assignment asks you to write that method. MethodFunction: void printFrequencies Finally, write a methodfunction that prints out the word frequency count onto the screen. The printout should be ordered by decreasing frequency so the highest frequency words first; if necessary, order the cases of ties alphabetically The TA will use their own test text files. For this part, it is expected that your program will read this text file, tokenize it count the tokens, and print out the token word frequencies. Your program must run from the command line: write a program that takes one text file as an argument and outputs the token frequencies. Please, use one of the following output format examples when you print out the result: t Part B: Intersection of two files points Write a program that takes two text files from the command line as arguments and outputs the number of tokens they have in common. Here is an example of inputoutput: Example Input.png You can reuse the code you wrote for part A remember that you can import files, avoiding, thus, code duplication! The TA will use their own text files. Note that some of the text files may be VERY LARGE, so make sure your program is not dependent on reading the entire files to the computer RAM. For this part, programs that perform better will be given more credit than those that perform poorly. Common Tasks and Important Notes For both part A and part B please add a brief runtime complexity explanation for your code as a comment on top of each method or function does it run in linear time relative to the size of the input? Polynomialtime? Exponentialtime? This explanation, the comments written throughout your code, and your code's actual conformance with this explanation will be the basis for evaluating the performance of your program. You should get the file names from command line arguments. Do not hardcode the input file names in your code or read them from system standard input stdin As the assignment will be graded using an automatic grader, not doing this may result in losing the whole credit for the assignment this means getting a zero in the entire assignment Exception handling is required for bad inputs. An example of bad input would be a character in a nonEnglish language. Your code should be able to tokenize the whole input file, even though there may
This assignment is to be done individually. You cannot use code written by your classmates. Use code found over the Internet at your own peril it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism. Use EdDiscussion for general questions whose answers can benefit you and everyone.
General Specifications
You must write this assignment in Python preferably as these versions are installed in the openlab.ics.uci.edu machines; make sure your program works at openlab as that environment might be used to test your code
Make sure to break down your program into classesmethodsfunctions corresponding to the parts in this specification. They will be tested separately.
The function signatures in this specification are informal; their purpose is to explain the inputs and outputs of the methods.
When you are developing, use git locally to keep track of your development process if you don't know how to use git, see the end of this document You must submit your local git repository together with the assignment, otherwise, you might not get a grade.
Very important: At certain points, the assignment may be underspecified this is by design. In those cases, make your own choices and assumptions and be prepared to defend them in case you are questioned about them but keep in mind that you cannot go against what is specified here; ie you can do more than what was specified, but you cannot do less
Part A: Word Frequencies points
MethodFunction: List tokenizeTextFilePath
Write a methodfunction that reads in a text file and returns a list of the tokens in that file. For the purposes of this project, a token is a sequence of alphanumeric characters, independent of capitalization so Apple, apple, aPpLe are the same token You are allowed to use regular expressions if you wish to and you can use some regexp engine, no need to write it from scratch but you are not allowed to import a tokenizer eg from NLTK since you are being asked to write a tokenizer.
MethodFunction: Map computeWordFrequenciesList
Write another methodfunction that counts the number of occurrences of each token in the token list. Remember that you should write this assignment yourself from scratch, so you are not allowed to import a counter when the assignment asks you to write that method.
MethodFunction: void printFrequencies
Finally, write a methodfunction that prints out the word frequency count onto the screen. The printout should be ordered by decreasing frequency so the highest frequency words first; if necessary, order the cases of ties alphabetically
The TA will use their own test text files. For this part, it is expected that your program will read this text file, tokenize it count the tokens, and print out the token word frequencies. Your program must run from the command line: write a program that takes one text file as an argument and outputs the token frequencies.
Please, use one of the following output format examples when you print out the result:
t
Part B: Intersection of two files points
Write a program that takes two text files from the command line as arguments and outputs the number of tokens they have in common. Here is an example of inputoutput:
Example Input.png
You can reuse the code you wrote for part A remember that you can import files, avoiding, thus, code duplication!
The TA will use their own text files. Note that some of the text files may be VERY LARGE, so make sure your program is not dependent on reading the entire files to the computer RAM.
For this part, programs that perform better will be given more credit than those that perform poorly.
Common Tasks and Important Notes
For both part A and part B please add a brief runtime complexity explanation for your code as a comment on top of each method or function does it run in linear time relative to the size of the input? Polynomialtime? Exponentialtime? This explanation, the comments written throughout your code, and your code's actual conformance with this explanation will be the basis for evaluating the performance of your program.
You should get the file names from command line arguments. Do not hardcode the input file names in your code or read them from system standard input stdin As the assignment will be graded using an automatic grader, not doing this may result in losing the whole credit for the assignment this means getting a zero in the entire assignment
Exception handling is required for bad inputs. An example of bad input would be a character in a nonEnglish language. Your code should be able to tokenize the whole input file, even though there may
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started