Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Task For this project, you will be working with datasets, some small, some large. A dataset consists of records that contain fields. For example, a

Task For this project, you will be working with datasets, some small, some large. A dataset consists of records that contain fields. For example, a simple dataset of student data may look like this. Last Name First Name Year Credits Earned GPA Transfer Student Baldwin Alec junior 70 3.21 no McFerrin Bobby senior 115 2.85 no Perry Katy junior 65 2.76 yes Glover Savion freshman 12 3.50 no Each row in the table is a record, and each column (e.g. Year) is a field. You will be creating a program that reads, displays, cleans, and saves .csv (comma-separated value) dataset files. As we discussed in class, cleaning (or cleansing) data is the process of modifying data so that it is in a form that is valid for the purpose that it will be used (e.g. statistical analysis, plotting). Some forms of data cleaning are handling empty or partially-empty records, records with numeric fields that are out of range, and string fields that are misspelled. Notice the word handling rather than deleting. Depending on the datas purpose, one might deal with rather than delete a malformed record. For example, if a particular field will not be used for ones goal, it would not necessarily have to be deleted from the dataset. Additional Specifications As you know, you will need to modularize your code into functions. The number of functions is up to you. Remember that a function should perform only one task. For example, a function whose purpose is to perform a computation should do just that and then return the result. It should not also display the result of the computation. You are required to include the following function in your program. def displayMenu(choices) Prints out any of the program menus. Here is an example menu that this function might generate: 1) Delete totally empty records 2) Delete partially empty records 3) Delete duplicate records Takes in a list, which contains the choices for the menu options Returns nothing, as it is a print function Input Validation You are required to validate input from the user. You can assume that the user will enter the right type of input (e.g. an integer), but not that they will enter a correct value. For example, a user will always give an integer when you expect one, but it may be out of the allowable range. You will need to validate the following items: Menu option numbers The values of Y, y, N, and n for yes/no questions You do NOT need to validate that the user has entered a q for the program quit value. Disease and state names should be case insensitive. For example, if the user wants to see all records for the state of Maryland, they may enter maryland, MARYLAND, MaRyLaNd, etc. See the sample output for examples of how this should work in your program. Details For this project, you will be designing and implementing a program to display and/or clean a user-specified dataset. The program will read in a dataset from a .csv file, and then give the user the options of displaying, cleaning, or saving the dataset. The program will operate on only one dataset at a time. The particular data used in this project deals with the occurrence of common diseases in the United States by state and year between the years 1928 and 2011, inclusive. Dataset records contain the following fields in the order given. - Disease (a string) - State by full name (a string) - The number of occurrences of the disease (an integer >= 0) - The states population (an integer >= 0) - The year for which the data was collected (an integer >= 0) The possible diseases are: measles, polio, smallpox, pertussis, rubella, mumps, and hepatitis A. Here is a sample of a dataset with five records. POLIO,IOWA,442,2617000,1951 SMALLPOX,NEW YORK,422,128480000,1931 POLIO,MAINE,20,806000,1943 MEASLES,ARKANSAS,8899,1847000,1928 POLIO,IOWA,1282,2625000,1950 Notice that the data is not ordered by any particular field (e.g. year). Also notice that the disease and state names are stored in all capital letters. Your program must also handle malformed datasets. There will be three possible cases that must be handled. 1) All of the fields of a record are empty o The record will be a succession of four commas: ,,,, 2) One or more fields of a record are empty, but not all of the fields. Examples: o POLIO,,442,2617000,1951 o MEASLES,,8899,1847000, o ,,1282,, 3) There are duplicate records. A duplicate record is one that is identical to one or more other records in disease, state, and year. For example, the following three records are duplicates. POLIO,IOWA,300,2617000,1951 POLIO,IOWA,442,2617000,1951 POLIO,IOWA,,2617000,1951 Your program should handle duplicate records by keeping the first one encountered and deleting the rest. So, for the above three records, only the first (POLIO,IOWA,300,2617000,1951) would remain after removing duplicates. You are provided with two functions that Dr. Mitchell has written: make2DList(), which reads a .csv file and returns a 2- dimensional list of the file data saveDataToFile(), which saves a 2-dimensional list of data to a .csv file These functions can be found in the Dr. Mitchells public directory in a single file called proj2.py. Do NOT change them they work fine! You can download the file to your proj2 directory using the following command. Dont forget the dot (.) at the end of the command! (It specifies to Linux to copy the file to your current directory.) Hints and Advice 1) This would be a very good time to use incremental programming! Incremental development is when you are only working on a small piece of the code at a time, and testing that the piece of code works before moving on to the next piece. This makes it a lot easier to fix any mistakes. 2) Make the displaying of each of your menus into their own functions. 3) It would be a good idea to implement the following two functions. recordToString() converts a dataset record to a string for easy display of the records information getValidInput(prompt, minimum, maximum) validates that a number is between minimum and maximum values, inclusive, or that it is q to quit the program. (You did this for Project 1.) 4) Think your constants through carefully. Most of them will be related to your menus, but there should be others. Test Data You are provided with test data files that exercise various aspects of your program. These files can be downloaded from Dr. Mitchells directory, along with a file (proj2.py) containing the functions that reads them in and writes them out (saves them to a user-specified file). Here are the Linux commands to copy the files from Dr. Mitchells directory to your proj2 directory. Dont forget the dot (.) at the end of each command! (It specifies to Linux to copy the file to your current directory.) CMSC 201 Computer Science I for Majors Page 8 cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/proj2.py . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/all.csv . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/short.csv . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/empty.csv . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/partially_empty.csv . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/dups.csv . cp /afs/umbc.edu/users/s/m/smitchel/pub/cs201/empty_part_dups.csv Note that the file all.csv contains the full dataset of 14,265 records. There are no empty, partially-empty, or duplicate records in this dataset. However, these menu options should still work, always deleting zero records. (Note: your program will probably take 1 to 2 minutes when attempting to find duplicate records there is a lot of data!) It has been given to you for two reasons: 1) your program should work with a very large dataset such as this, and 2) you might find the data interesting. For example, take a look at how the numbers for polio drastically decreased in the 1960s due to the availability of a vaccine. You may also want to create your own test data files, as your program will be tested with other data files in addition to the ones that you download. You can create a test data file simply by using the emacs editor, as you would a Python program (.py) file.

1)Here is a sample of how the project runs: https://s3.us-east-1.amazonaws.com/blackboard.learn.xythos.prod/5954eb74c7df4/3910244?response-content-disposition=inline%3B%20filename%2A%3DUTF-8%27%27sampleRun_Project2_F18.pdf&response-content-type=application%2Fpdf&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20181103T230010Z&X-Amz-SignedHeaders=host&X-Amz-Expires=21600&X-Amz-Credential=AKIAIL7WQYDOOHAZJGWQ%2F20181103%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=1c9e8d4f846bfebba9735b3146ebd7f96a1b9b1433346ad8b92089cf76536f9f.

2) This is my design of the project: https://pastebin.com/raw/K5ELaCAj

3) Inital part of the main project now: https://pastebin.com/raw/CCcqB599

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke, David J. Auer

7th edition

133544621, 133544626, 0-13-354462-1, 978-0133544626

More Books

Students also viewed these Databases questions