Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

This must be in Python!!! Read everything carefully!!! Please if you would, put the code on the paste.ee site for clarity. Dictionaries and File I/O

This must be in Python!!! Read everything carefully!!! Please if you would, put the code on the paste.ee site for clarity.

Dictionaries and File I/O

Background:

The purpose of this assignment is to explore dictionaries and file reading and writing. We will be reading in some data about baby names, creating a dictionary to organize all the data, and then checking for name statistics in various ways.

What can I use?

You may only use the following things. We are quite unlikely to add to the list, but note there are some useful things that weren't available earlier in the semester. If you use something disallowed, it's not an honor code violation, you just won't get points for some or all of that function (still not fun!).

Restrictions

no modules may be imported. (this definitely includes the csv module do the work directly!)

Allowed

all basic expressions and statements

file reading: open(), close(), read(), readline(), readlines(), with syntax

dictionaries: len, get, clear, copy, keys, values, items, pop, popitem, update

all sets and set operations; all exception handling blocks/raises.

functions: len(), range(), min(), max(), enumerate(), int(), and any functions you write

list methods: remove(), insert(), append(), extend(), pop(), index()

string methods: split(), endswith(), startswith(), join(), insert(), index()

more: sorted(), sort(), reversed(), reverse()

calling other functions of the project (and your own defined helper functions). Please do this! J

Note: Put comments for these functions.

Example databases: http://cs.gmu.edu/~marks/112/projects/p5files/

Scenario

We've got some data (about registered baby names in various years) stored in a comma-separated-values file; only read_file needs to interact with files, and all others will use our required structure to describe the names, genders, and counts per year.

Definitions

CSV file: This is our name for a file containing ascii text where each line in the file represents one record of information; each piece of info is surrounded by double quotes, and each of these quoted things is separated by a single comma. That's the exact formatting; nothing extra, nothing less. The very first line is the "header" row, which names the columns but is not part of the data. Here is a very small sample file that can be used in our project.

Note: a file's extension has no actual effect on its contents.

These are ascii files, so you can/should edit them with your code editor just as easily as a .txt or .py file.

We recommend not using Excel to open the files! It saves in a slightly different format, causing trouble.

"YEAR","GENDER","NAME","COUNT"

"2009","MALE","DANIEL","3423"

"2009","MALE","ANTHONY","3106"

"2009","MALE","ANGEL","3058"

"2010","MALE","JACOB","3368"

"2010","MALE","DANIEL","3175"

"2010","MALE","ANTHONY","2882"

Database: A database allows us to store multiple names from multiple years in an organized fashion. Our database is a dictionary whose keys are tuples of (name, gender), and whose values are lists of popularity length-3 tuples in the form of (year, count, rank). When there are multiple years data for a given (name, gender), the list should have multiple tuples sorted by years. An example dictionary corresponding to the CSV file above would be:

sample_db = { ('DANIEL', 'MALE'): [(2009, 3423, 1), (2010, 3175, 2)], ('ANTHONY', 'MALE'): [(2009, 3106, 2), (2010, 2882, 3)], ('ANGEL', 'MALE'): [(2009, 3058, 3)], ('JACOB', 'MALE'): [(2010, 3368, 1)] }

This indicates that in 2009, Daniel was used as a male name 3423 times, and was the most popular male name that year in our records; also, Daniel was used as a male name 3175 times in 2010, and was the second-most popular male name that year in our records (second to Jacob). Similarly so for the rest of the entries.

Two kinds of Database: Ranked and Unranked

We either call a database ranked, where all ranks have been correctly filled in, or unranked, where ranks are either None or no longer correct due to an addition. It is common to begin filling in a database with None as the rank value, creating an unranked database, and then we will go back and recalculate/fix all the rankings. When naming function arguments, we use db and rdb accordingly to remind us what we've got.

We will use a few different csv files as our examples and in testing. They come from the shared files linked at the start of this document.

Function dealing with File Reading

This is the only function that deals with file reading; you can attempt it separately from all the other functions, because the other functions accept database dictionaries, and not file names they don't rely upon this function's output at all. If this function is taking up significant time please use your time wisely and keep working through the other required functions to maximize your time spent and score earned.

read_file(filename): This will accept the file name as a string, and assumes it is a CSV file as described above (with our name data in the same format as the example, but with any number of rows after the header row). It will open the file, read all the name entries, and correctly create the unranked database.

Return the resulting unranked database.

Set all rankings to None.

You can assume that for any given name/gender, there will be at most one entry for each year.

Sort all [(year,count,rank)] lists by year.

Hints:

o How can you break this task down into multiple phases, each one taking a pass over the data and making something slightly more useful towards getting the result?

o what functions in this project can you call to make read_file much easier to implement?

Functions dealing with ranked databases

These functions rely upon rankings. You can assume the incoming database is always correctly ranked.

get_rank_for_name_year (rdb, name, gender, year): accepts a ranked database rdb, name, gender and year. It finds and returns the rank of that name/gender in the specified year. It returns None if there is no relevant record in the database.

popularity_by_name(rdb, name, gender): accepts a ranked database rdb, name, and gender. It finds the ranks for all years included in rdb for name, assemble them in a list of pairs [(year,rank)], and return the list. If rdb has no records for name, return []. Sort multiple years records (tuples) by year.

popularity_by_year(rdb, gender, year, top=10): accepts a ranked database rdb, gender, year, and top. It finds for the specified year, the top popular names and returns them in a list of pairs [(rank,name)]. Sort the list of pairs. You can assume top is always a positive integer. If top is not provided, use default value and report the top 10 popular names. If top is larger than the number of stored names for that year, report all names in the right order.

always_popular_names(rdb, gender, years=None, top=10): accepts a ranked database rdb, gender, a list of years and a top threshold. It searches in the database for (name, gender) records that for all indicated years are present and always ranked within (and including) top, and return them as a list of names, alphabetically sorted. Their rankings across the years are not part of the returned answer.

If years is not provided, use all years present anywhere in the database.

If top is not provided, default to 10.

Hint: call previous functions!

Functions creating/updating ranked databases

These functions start from a database that either has incomplete rankings, or with incorrect rankings. They will need to create correct rankings for the rsulting database.

rank_names_by_year_gender (db, year, gender) : This function accepts an existing (unranked) database db, a year and a gender. It calculates the ranking of names according to their counts and updates that information into the database. Rank male and female names separately. The most popular name for each gender (with the highest count) gets a rank value of 1.

Assign all tied-count names with the same rank and make sure the next rank is adjusted accordingly. Given counts of A:10, B:5, C:5, D:5, E:1, they'd get rankings of A=1, B=2, C=2, D=2, E=5.

This function updates the database in-place and returns None.

rank_names (db) : This function accepts an existing database and ranks all names for all years of data present, making the database become ranked.

This function should return None.

Rank male and female names separately.

Hint: use previous functions!

Extra Credit

merge_databases(db1, db2): accepts two databases. This function creates and returns a new database containing the entries from both sources. db1 and db2 can't be modified. When the same name-genderyear is encountered in both databases, we must add the counts of each together for the new database (pretend we are merging data from each state in the USA). The result must be re-ranked before returning it.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Advances In Spatial And Temporal Databases 8th International Symposium Sstd 2003 Santorini Island Greece July 2003 Proceedings Lncs 2750

Authors: Thanasis Hadzilacos ,Yannis Manolopoulos ,John F. Roddick ,Yannis Theodoridis

2003rd Edition

3540405356, 978-3540405351

More Books

Students also viewed these Databases questions

Question

Evaluate each expression. 5 - 1 + 6 - 1

Answered: 1 week ago