Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 21, 2024

Please provide python code for questions 4-10 if possible, start at 4. I have provided the code I wrote for parts 1 through 3 below

Please provide python code for questions 4-10 if possible, start at 4. I have provided the code I wrote for parts 1 through 3 below the questions. Thanks

Challenge 1

Open up a new IPython notebook

Download a few MTA turnstile data files

Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key should be a list of lists. Each list in the list is the rest of the columns in a row. For example, one key-value pair should look like

{ ('A002','R051','02-00-00','LEXINGTON AVE'): [ ['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'], ['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'], ['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'], ... ] }

Challenge 2

Let's turn this into a time series.

For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the count of entries.

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object.

Your new dict should look something like

{ ('A002','R051','02-00-00','LEXINGTON AVE'): [ [datetime.datetime(2013, 3, 2, 3, 0), 3788], [datetime.datetime(2013, 3, 2, 7, 0), 2585], [datetime.datetime(2013, 3, 2, 12, 0), 10653], [datetime.datetime(2013, 3, 2, 17, 0), 11016], [datetime.datetime(2013, 3, 2, 23, 0), 10666], [datetime.datetime(2013, 3, 3, 3, 0), 10814], [datetime.datetime(2013, 3, 3, 7, 0), 10229], ... ], .... }

Challenge 3

These counts are for every n hours. (What is n?) We want total daily entries.

Now make it that we again have the same keys, but now we have a single value for a single day, which is the total number of passengers that entered through this turnstile on this day.

Challenge 4

We will plot the daily time series for a turnstile.

In ipython notebook, add this to the beginning of your next cell:

%matplotlib inline

This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with

import matplotlib.pyplot as plt

Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:

plt.figure(figsize=(10,3)) plt.plot(dates,counts)

Challenge 5

So far we've been operating on a single turnstile level, let's combine turnstiles in the same ControlArea/Unit/Station combo. There are some ControlArea/Unit/Station groups that have a single turnstile, but most have multiple turnstilea-- same value for the C/A, UNIT and STATION columns, different values for the SCP column.

We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.

Challenge 6

Similarly, combine everything in each station, and come up with a time series of [(date1, count1),(date2,count2),...] type of time series for each STATION, by adding up all the turnstiles in a station.

Challenge 7

Plot the time series for a station.

Challenge 8

Make one list of counts for one week for one station. Monday's count, Tuesday's count, etc. so it's a list of 7 counts. Make the same list for another week, and another week, and another week. plt.plot(week_count_list) for every week_count_list you created this way. You should get a rainbow plot of weekly commute numbers on top of each other.

Challenge 9

Over multiple weeks, sum total ridership for each station and sort them, so you can find out the stations with the highest traffic during the time you investigate

Challenge 10

Make a single list of these total ridership values and plot it with

plt.hist(total_ridership_counts)

to get an idea about the distribution of total ridership among different stations.

This should show you that most stations have a small traffic, and the histogram bins for large traffic volumes have small bars.

Additional Hint:

If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bargraph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,..., so you can do

indices = range(len(total_ridership_values)) plt.bar(indices, total_ridership_values)

Code for 1-3

#Problem 1.1

from __future__ import print_function, division

import pandas as pd import numpy as np import matplotlib.pyplot as plt

%matplotlib inline

import datetime

# Source: http://web.mta.info/developers/turnstile.html def get_data(week_nums): url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt" dfs = [] for week_num in week_nums: file_url = url.format(week_num) dfs.append(pd.read_csv(file_url)) return pd.concat(dfs) week_nums = [160903, 160910, 160917] turnstiles_df = get_data(week_nums)

turnstiles_df.head()

turnstiles_df.columns = [column.strip() for column in turnstiles_df.columns]

turnstiles_df.columns

turnstiles_df.info()

turnstiles_df.head()

turnstiles_df.tail()

# Three weeks of Data turnstiles_df.DATE.value_counts().sort_index()

#Problem 1.2

turnstiles_df.columns

from datetime import datetime as dt

mask = ((turnstiles_df["C/A"] == "A002") & (turnstiles_df["UNIT"] == "R051") & (turnstiles_df["SCP"] == "02-00-00") & (turnstiles_df["STATION"] == "59 ST")) turnstiles_df[mask].head()

# Take the date and time fields into a single datetime column turnstiles_df["DATE_TIME"] = pd.to_datetime(turnstiles_df.DATE + " " + turnstiles_df.TIME, format="%m/%d/%Y %H:%M:%S")

mask = ((turnstiles_df["C/A"] == "R626") & (turnstiles_df["UNIT"] == "R062") & (turnstiles_df["SCP"] == "00-00-00") & (turnstiles_df["STATION"] == "CROWN HTS-UTICA")) turnstiles_df[mask].head()

# turnstiles_df = .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"]).ENTRIES.count().reset_index().sort_values("ENTRIES", ascending=False)

# Sanity Check to verify that "C/A", "UNIT", "SCP", "STATION", "DATE_TIME" is unique (turnstiles_df .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"]) .ENTRIES.count() .reset_index() .sort_values("ENTRIES", ascending=False)).head(5)

# On 9/16, we seem to have two entries for same time. Let's take a look mask = ((turnstiles_df["C/A"] == "R504") & (turnstiles_df["UNIT"] == "R276") & (turnstiles_df["SCP"] == "00-00-01") & (turnstiles_df["STATION"] == "VERNON-JACKSON") & (turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 9, 16).date())) turnstiles_df[mask].head()

turnstiles_df.DESC.value_counts()

# Get rid of the duplicate entry turnstiles_df.sort_values(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True, ascending=False) turnstiles_df.drop_duplicates(subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=True)

# Drop Exits and Desc Column. To prevent errors in multiple run of cell, errors on drop is ignored turnstiles_df = turnstiles_df.drop(["EXITS", "DESC"], axis=1, errors="ignore")

#Problem 1.3

turnstiles_daily = turnstiles_df.groupby(["C/A", "UNIT", "SCP", "STATION", "DATE"]).ENTRIES.first().reset_index()

turnstiles_daily.head()

turnstiles_daily[["PREV_DATE", "PREV_ENTRIES"]] = (turnstiles_daily.groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES"].transform(lambda grp: grp.shift(1)))

turnstiles_daily.head()

turnstiles_daily.tail()

# Drop the rows for last date turnstiles_daily.dropna(subset=["PREV_DATE"], axis=0, inplace=True)

turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]].head()

# What's the deal with counter being in reverse mask = ((turnstiles_df["C/A"] == "A011") & (turnstiles_df["UNIT"] == "R080") & (turnstiles_df["SCP"] == "01-00-00") & (turnstiles_df["STATION"] == "57 ST-7 AV") & (turnstiles_df["DATE_TIME"].dt.date == datetime.datetime(2016, 8, 27).date())) turnstiles_df[mask].head()

# Let's see how many stations have this problem

(turnstiles_daily[turnstiles_daily["ENTRIES"] < turnstiles_daily["PREV_ENTRIES"]] .groupby(["C/A", "UNIT", "SCP", "STATION"]) .size())

def get_daily_counts(row, max_counter): counter = row["ENTRIES"] - row["PREV_ENTRIES"] if counter < 0: counter = -counter if counter > max_counter: print(row["ENTRIES"], row["PREV_ENTRIES"]) return 0 return counter

# If counter is > 1Million, then the counter might have been reset. # Just set it to zero as different counters have different cycle limits _ = turnstiles_daily.apply(get_daily_counts, axis=1, max_counter=1000000)

def get_daily_counts(row, max_counter): counter = row["ENTRIES"] - row["PREV_ENTRIES"] if counter < 0: # May be counter is reversed? counter = -counter if counter > max_counter: print(row["ENTRIES"], row["PREV_ENTRIES"]) counter = min(row["ENTRIES"], row["PREV_ENTRIES"]) if counter > max_counter: # Check it again to make sure we are not giving a counter that's too big return 0 return counter

# If counter is > 1Million, then the counter might have been reset. # Just set it to zero as different counters have different cycle limits turnstiles_daily["DAILY_ENTRIES"] = turnstiles_daily.apply(get_daily_counts, axis=1, max_counter=1000000)

turnstiles_daily.head()

#end of #3 please follow the above questions to write code for parts 4-10