Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1 . To load the data using read _ csv for the data files attached ( tips . txt or tips.csv ) , save the

1. To load the data using read_csv for the data files attached (tips.txt or tips.csv), save the files and
path of your own, to include the directory path of the loaded files, like below example of the
saved path:
In: tips= pd.read_csv('c:/Users/mowaf/Data/tips.txt', sep=',')
In: tips= pd.read_csv('c:/Users/mowaf/Data/tips.csv')
2. To view the file dataset (first 5 rows):
In: tips.head()
obs totbill tip sex smoker day time size
0116.991.01 F No Sun Night 2
1210.341.66 M No Sun Night 3
2321.013.50 M No Sun Night 3
3423.683.31 M No Sun Night 2
4524.593.61 F No Sun Night 4
3. To make a stacked bar plot showing the percentage of data points for each party size on each
day, make a cross-tabulation by day and party size and sh:
In[]: party_count=pd.crosstab(tips['day'],tips['size'])
In []: party_count
Out:
size 123456
day
Fri 1161100
Sat 253181310
Sun 039151831
Thu 1484513
4. To normalize the data, so that each row sums to 1 and make the plot:
In: party_counts = party_count.loc[:,2:5]
In: party_counts
size 2345
day
Fri 16110
Sat 5318131
Sun 3915183
Thu 48451
5. Normalize so that each row sums to 1 to make a bar plot of the party size (counts) over the
weekdays:
# Normalize to sum to 1
In: party_pcts = party_counts.div(party_counts.sum(1), axis=0)
In: party_pcts
size 2345
day
Fri 0.8888890.0555560.0555560.000000
Sat 0.6235290.2117650.1529410.011765
Sun 0.5200000.2000000.2400000.040000
Thu 0.8275860.0689660.0862070.017241
6. Draw plot(kind='bar') of the party_pcts from step 5.
Show your plot output.
7. To find the percentage of the tip for each bill, create a column tip_pct:
In: tips['tip_pct']= tips['tip']/ tips['totbill']
tips.head(5)
Similarly, to list the first five rows of the dataset:
tips[:6]
Show your output.
8. Draw a histogram plot (frequency of the data points split into discrete, evenly spaced bins, with
the number of data points in each bin), using the tip percentages of the total bill, tips[tip_pct],
of step 7:
tips['tip_pct'].plot.hist(bins=50)
Show your output.
9. To show a summary statistics of the percentag of tips tip_pct for day, smoker, sex, and time
columns, we can call the describe on a groupby object:
tips.groupby('smoker')['tip_pct'].describe()
Show your output.
Similarly, show the tip_pct aggregated on sex,day and time
Show your output.
10. Aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the
desired function or calling a method like mean or std. However, we may want to aggregate using
a different function depending on the column, or multiple functions at once.
To group the tips by day and smoker:
grouped = tips.groupby(['day', 'smoker'])
grouped
Out[130]:
To check and view the output of the aggregated files:
grouped.describe()
Show your output.
11. To calculate the average of tip_pct of the aggregated group grouped_pct on day and
smoker
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thu No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
12. Alternatively, we can pass a list of functions or function names instead, you get back a
DataFrame with column names taken from the functions:
grouped_pct.agg(['mean','std'])
mean std
day smoker
Fri No 0.1516500.028123
Yes 0.1747830.051293
Sat No 0.1580480.039767
Yes 0.1479060.061375
Sun No 0.1601130.042347
Yes 0.1872500.154134
Thu No 0.1602980.038774
Yes 0.1638630.039389
13. Similarly, calculate the average of tip_pct aggregated on day and sex
Show your code and output.
14. With a DataFrame, we can also specify a list of functions to apply to all of the columns or
different functions per column. To compute the same three statistics for the tip_pct and totalbill
columns:
grouped['tip_pct','totbill'].agg(['count','mean', 'max'])
tip_pct totbill
count mean max count mean max
day smoker
Fri No 40.1516500.187735418.42000022.75
Yes 150.1747830.2634801516.81333340.17
Sat No 450.1580480.2919904519.66177848.33
Yes 420.1479060.3257334221.27666750.81
Sun No 570.1601130.2526725720.50666748.17
Yes 190.1872500.7103451924.12000045.35
Thu No 450.1602980.2663124517.11311141.19
Yes 170.1638630.2412551719.19058843.11
15. Suppose we wanted to select the top five tip_pct values by group. First, to write a function that
selects the rows with the largest values in a particular column:
In: def top(df, n=5, column='tip_pct'): return df.sort()[-n:]
top(tips, n=5)
Show your output.
then, to group by smoker and call apply with the top() function:

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Pro PowerShell For Database Developers

Authors: Bryan P Cafferky

1st Edition

1484205413, 9781484205419

More Books

Students also viewed these Databases questions