Question
Project One: Data Visualization, Descriptive Statistics, Confidence Intervals This notebook contains the step-by-step directions for Project One. It is very important to run through the
Project One: Data Visualization, Descriptive Statistics, Confidence Intervals
This notebook contains the step-by-step directions for Project One. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to write your summary report.
You are a data analyst for a basketball team and have access to a large set of historical data that you can use to analyze performance patterns. The coach of the team and your management have requested that you use descriptive statistics and data visualization techniques to study distributions of key performance metrics that are included in the data set. These data-driven analytics will help make key decisions to improve the performance of the team. You will use the Python programming language to perform the statistical analyses and then prepare a report of your findings to present for the team's management. Since the managers are not data analysts, you will need to interpret your findings and describe their practical implications.
There are four important variables in the data set that you will study in Project One.
Variable
What does it represent?
pts
Points scored by the team in a game
elo_n
A measure of the relative skill level of the team in the league
year_id
Year when the team played the games
fran_id
Name of the NBA team
The ELO rating, represented by the variableelo_n, is used as a measure of the relative skill of a team. This measure is inferred based on the final score of a game, the game location, and the outcome of the game relative to the probability of that outcome. The higher the number, the higher the relative skill of a team.
In addition to studying data on your own team, your management has assigned you a second team so that you can compare its performance with your own team's.
Team
What does it represent?
Your Team
This is the team that has hired you as an analyst. This is the team that you will pick below. See Step 2.
Assigned Team
This is the team that the management has assigned to you to compare against your team. See Step 1.
Reminder: It may be beneficial to review the summary report template for Project One prior to starting this Python script. That will give you an idea of the questions you will need to answer with the outputs of this script.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 1: Data Preparation & the Assigned Team
This step uploads the data set from a CSV file. It also selects the assigned team for this analysis. Do not make any changes to the code block below.
- Theassigned teamis theChicago Bullsfrom the years1996-1998
Click the block of code below and hit theRunbutton above.
In[1]:
import numpy as np import pandas as pd import scipy.stats as st import matplotlib.pyplot as plt from IPython.display import display, HTML nba_orig_df = pd.read_csv('nbaallelo.csv') nba_orig_df = nba_orig_df[(nba_orig_df['lg_id']=='NBA') & (nba_orig_df['is_playoffs']==0)] columns_to_keep = ['game_id','year_id','fran_id','pts','opp_pts','elo_n','opp_elo_n', 'game_location', 'game_result'] nba_orig_df = nba_orig_df[columns_to_keep] # The dataframe for the assigned team is called assigned_team_df. # The assigned team is the Chicago Bulls from 1996-1998. assigned_years_league_df = nba_orig_df[(nba_orig_df['year_id'].between(1996, 1998))] assigned_team_df = assigned_years_league_df[(assigned_years_league_df['fran_id']=='Bulls')] assigned_team_df = assigned_team_df.reset_index(drop=True) display(HTML(assigned_team_df.head().to_html())) print("printed only the first five observations...") print("Number of rows in the data set =", len(assigned_team_df))
game_idyear_idfran_idptsopp_ptselo_nopp_elo_ngame_locationgame_result0199511030CHI1996Bulls105911598.29241531.7449HW1199511040CHI1996Bulls107851604.39401458.6415HW2199511070CHI1996Bulls1171081605.79831310.9349HW3199511090CLE1996Bulls106881618.87011452.8268AW4199511110CHI1996Bulls1101061621.15911490.2861HW
printed only the first five observations... Number of rows in the data set = 246
Step 2: Pick Your Team
In this step, you will pick your team. The range of years that you will study for your team is2013-2015. Make the following edits to the code block below:
- Replace??TEAM??with your choice of team from one of the following team names.
- Bucks, Bulls, Cavaliers, Celtics, Clippers, Grizzlies, Hawks, Heat, Jazz, Kings, Knicks, Lakers, Magic, Mavericks, Nets, Nuggets, Pacers, Pelicans, Pistons, Raptors, Rockets, Sixers, Spurs, Suns, Thunder, Timberwolves, Trailblazers, Warriors, Wizards
- Remember to enter the team name within single quotes. For example, if you picked the Suns, then ??TEAM?? should be replaced with 'Suns'.
After you are done with your edits, click the block of code below and hit theRunbutton above.
In[17]:
# Range of years: 2013-2015 (Note: The line below selects ALL teams within the three-year period 2013-2015. This is not your team's dataframe. your_years_leagues_df = nba_orig_df[(nba_orig_df['year_id'].between(2013, 2015))] # The dataframe for your team is called your_team_df. # ---- TODO: make your edits here ---- your_team_df = your_years_leagues_df[2013, 2015))]['fran_id']== 'Wizards')] your_team_df = your_team_df.reset_index(drop=True) display(HTML(your_team_df.head().to_html(2013, 2015))) print("printed only the first five observations...") print("Number of rows in the data set =", len('Wizards')) File "", line 6 your_team_df = your_years_leagues_df[2013, 2015))]['fran_id']== 'Wizards')] ^ SyntaxError: invalid syntax
Step 3: Data Visualization: Points Scored by Your Team
The coach has requested that you provide a visual that shows the distribution of points scored by your team in the years 2013-2015. The code below provides two possible options. PickONEof these two plots to include in your summary report. Choose the plot that you think provides the best visual for the distribution of points scored by your team. In your summary report, you must explain why you think your visual is the best choice.
Click the block of code below and hit theRunbutton above.
NOTE: If the plots are not created, click the code section and hit theRunbutton again.
In[21]:
import seaborn as sns # Histogram fig, ax = plt.subplots() plt.hist(your_team_df['246'], bins=20) plt.title('Histogram of points scored by Your Team in 2013 to 2015', fontsize=18) ax.set_xlabel('Points') ax.set_ylabel('Frequency') plt.show() print("") # Scatterplot plt.title('Scatterplot of points scored by Your Team in 2013 to 2015', fontsize=18) sns.regplot(your_team_df['year_id'], your_team_df['pts'], ci=None) plt.show() --------------------------------------------------------------------------- NameError Traceback (most recent call last)in 4 # Histogram 5 fig, ax = plt.subplots() ----> 6 plt.hist(your_team_df['246'], bins=20) 7 plt.title('Histogram of points scored by Your Team in 2013 to 2015', fontsize=18) 8 ax.set_xlabel('Points') NameError: name 'your_team_df' is not defined
Step 4: Data Visualization: Points Scored by the Assigned Team
The coach has also requested that you provide a visual that shows a distribution of points scored by the Bulls from years 1996-1998. The code below provides two possible options. PickONEof these two plots to include in your summary report. Choose the plot that you think provides the best visual for the distribution of points scored by your team. In your summary report, you will explain why you think your visual is the best choice.
Click the block of code below and hit theRunbutton above.
NOTE: If the plots are not created, click the code section and hit theRunbutton again.
In[20]:
import seaborn as sns import pandas as pd # Histogram fig, ax = plt.subplots() plt.hist(assigned_team_df['bulls'], bins=20) plt.title('Histogram of points scored by the Bulls in 1996 to 1998', fontsize=18) ax.set_xlabel('246') ax.set_ylabel('Frequency') plt.show() # Scatterplot plt.title('Scatterplot of points scored by the Bulls in 1996 to 1998', fontsize=18) sns.regplot(assigned_team_df['1996 to 1998'], assigned_team_df['246'], ci=None) plt.show() --------------------------------------------------------------------------- NameError Traceback (most recent call last)in 3 4 # Histogram ----> 5 fig, ax = plt.subplots() 6 plt.hist(assigned_team_df['bulls'], bins=20) 7 plt.title('Histogram of points scored by the Bulls in 1996 to 1998', fontsize=18) NameError: name 'plt' is not defined
Step 5: Data Visualization: Comparing the Two Teams
Now the coach wants you to prepare one plot that provides a visual of the differences in the distribution of points scored by the assigned team and your team. The code below provides two possible visuals. Choose the plot that allows for the best comparison of the data distributions.
Click the block of code below and hit theRunbutton above.
NOTE: If the plots are not created, click the code section and hit theRunbutton again.
In[10]:
import seaborn as sns # Side-by-side boxplots both_teams_df = pd.concat((assigned_team_df, your_team_df)) plt.title('Boxplot to compare points distribution', fontsize=18) sns.boxplot(x='fran_id',y='pts',data=both_teams_df) plt.show() print("") # Histograms fig, ax = plt.subplots() plt.hist(assigned_team_df['246'], 20, alpha=0.5, label='Assigned Team') plt.hist(your_team_df['246'], 20, alpha=0.5, label='Your Team') plt.title('Histogram to compare points distribution', fontsize=18) plt.xlabel('Points') plt.legend(loc='upper right') plt.show() --------------------------------------------------------------------------- NameError Traceback (most recent call last)in 2 3 # Side-by-side boxplots ----> 4 both_teams_df = pd.concat((assigned_team_df, your_team_df)) 5 plt.title('Boxplot to compare points distribution', fontsize=18) 6 sns.boxplot(x='fran_id',y='pts',data=both_teams_df) NameError: name 'pd' is not defined
Step 6: Descriptive Statistics: Relative Skill of Your Team
The management of your team wants you to run descriptive statistics on the relative skill of your team from 2013-2015. In this project, you will use the variable 'elo_n' to respresent the relative skill of the teams. Calculate descriptive statistics including the mean, median, variance, and standard deviation for the relative skill of your team. Make the following edits to the code block below:
- Replace??MEAN_FUNCTION??with the name of Python function that calculates the mean.
- Replace??MEDIAN_FUNCTION??with the name of Python function that calculates the median.
- Replace??VAR_FUNCTION??with the name of Python function that calculates the variance.
- Replace??STD_FUNCTION??with the name of Python function that calculates the standard deviation.
After you are done with your edits, click the block of code below and hit theRunbutton above.
In[14]:
print("Your Team's Relative Skill in 2013 to 2015") print("-------------------------------------------------------") # ---- TODO: make your edits here ---- mean = assigned_team_df['elo_n'].MEAN() median = assigned_team_df['elo_n'].MEDIAN() variance = assigned_team_df['elo_n'].VAR() stdeviation = assigned_team_df['elo_n'].STD() print('Mean =', round(mean,2)) print('Median =', round(median,2)) print('Variance =', round(variance,2)) print('Standard Deviation =', round(stdeviation,2)) Your Team's Relative Skill in 2013 to 2015 ------------------------------------------------------- --------------------------------------------------------------------------- NameError Traceback (most recent call last)in 3 4 # ---- TODO: make your edits here ---- ----> 5 mean = assigned_team_df['elo_n'].MEAN() 6 median = assigned_team_df['elo_n'].MEDIAN() 7 variance = assigned_team_df['elo_n'].VAR() NameError: name 'assigned_team_df' is not defined
Step 7 - Descriptive Statistics - Relative Skill of the Assigned Team
The management also wants you to run descriptive statistics for the relative skill of the Bulls from 1996-1998. Calculate descriptive statistics including the mean, median, variance, and standard deviation for the relative skill of the assigned team.
You are to write this code block yourself.
Use Step 6 to help you write this code block. Here is some information that will help you write this code block.
- The dataframe for the assigned team is called assigned_team_df.
- The variable 'elo_n' respresent the relative skill of the teams.
- Your statistics should be rounded to two decimal places.
Write your code in the code block section below. After you are done, click this block of code and hit theRunbutton above. Reach out to your instructor if you need more help with this step.
In[]:
# Write your code in this code block.
Step 8: Confidence Intervals for the Average Relative Skill of All Teams in Your Team's Years
The management wants to you to calculate a 95% confidence interval for the average relative skill of all teams in 2013-2015. To construct a confidence interval, you will need the mean and standard error of the relative skill level in these years. The code block below calculates the mean and the standard deviation. Your edits will calculate the standard error and the confidence interval. Make the following edits to the code block below:
- Replace??SD_VARIABLE??with the variable name representing the standard deviation of relative skill of all teams from your years.(Hint: thestandard deviationvariable is in the code block below)
- Replace??CL??with the confidence level of the confidence interval.
- Replace??MEAN_VARIABLE??with the variable name representing the mean relative skill of all teams from your years.(Hint: themeanvariable is in the code block below)
- Replace??SE_VARIABLE??with the variable name representing the standard error.(Hint: thestandard errorvariable is in the code block below)
The management also wants you to calculate the probability that a team in the league has a relative skill level less than that of the team that you picked. Assuming that the relative skill of teams is Normally distributed, Python methods for a Normal distribution can be used to answer this question. The code block below uses two of these Python methods. Your task is to identify the correct Python method and report the probability.
After you are done with your edits, click the block of code below and hit theRunbutton above.
In[15]:
print("Confidence Interval for Average Relative Skill in the years 2013 to 2015") print("------------------------------------------------------------------------------------------------------------") # Mean relative skill of all teams from the years 2013-2015 mean = your_years_leagues_df['elo_n'].mean() # Standard deviation of the relative skill of all teams from the years 2013-2015 stdev = your_years_leagues_df['elo_n'].std() n = len(your_years_leagues_df) #Confidence interval # ---- TODO: make your edits here ---- stderr = ??SD_VARIABLE??/(n ** 0.5) conf_int_95 = st.norm.interval(??CL??, ??MEAN_VARIABLE??, ??SE_VARIABLE??) print("95% confidence interval (unrounded) for Average Relative Skill (ELO) in the years 2013 to 2015 =", conf_int_95) print("95% confidence interval (rounded) for Average Relative Skill (ELO) in the years 2013 to 2015 = (", round(conf_int_95[0], 2),",", round(conf_int_95[1], 2),")") print(" ") print("Probability a team has Average Relative Skill LESS than the Average Relative Skill (ELO) of your team in the years 2013 to 2015") print("----------------------------------------------------------------------------------------------------------------------------------------------------------") mean_elo_your_team = your_team_df['elo_n'].mean() choice1 = st.norm.sf(mean_elo_your_team, mean, stdev) choice2 = st.norm.cdf(mean_elo_your_team, mean, stdev) # Pick the correct answer. print("Which of the two choices is correct?") print("Choice 1 =", round(choice1,4)) print("Choice 2 =", round(choice2,4)) File "", line 14 stderr = ??SD_VARIABLE??/(n ** 0.5) ^ SyntaxError: invalid syntax
Step 9 - Confidence Intervals for the Average Relative Skill of All Teams in the Assigned Team's Years
The management also wants to you to calculate a 95% confidence interval for the average relative skill of all teams in the years 1996-1998. Calculate this confidence interval.
You are to write this code block yourself.
Use Step 8 to help you write this code block. Here is some information that will help you write this code block. Reach out to your instructor if you need help.
- The dataframe for the years 1996-1998 is called assigned_years_league_df
- The variable 'elo_n' represents the relative skill of teams.
- Start by calculating the mean and the standard deviation of relative skill (ELO) in years 1996-1998.
- Calculate n that represents the sample size.
- Calculate the standard error which is equal to the standard deviation of Relative Skill (ELO) divided by the square root of the sample size n.
- Assuming that the population standard deviation is known, use Python methods for the Normal distribution to calculate the confidence interval.
- Your statistics should be rounded to two decimal places.
The management also wants you to calculate the probability that a team had a relative skill level less than the Bulls in years 1996-1998. Assuming that the relative skill of teams is Normally distributed, calculate this probability.
You are to write this code block yourself.
Use Step 8 to help you write this code block. Here is some information that will help you write this code block.
- Calculate the mean relative skill of the Bulls. Note that the dataframe for the Bulls is called assigned_team_df. The variable 'elo_n' represents the relative skill.
- Use Python methods for a Normal distribution to calculate this probability.
- The probability value should be rounded to four decimal places.
Write your code in the code block section below. After you are done, click this block of code and hit theRunbutton above. Reach out to your instructor if you need more help with this step.
In[]:
# Write your code in this code block section
End of Project One
Download the HTML output and submit it with your summary report for Project One. The HTML output can be downloaded by clickingFile, thenDownload as, thenHTML. Do not include the Python code within your summary report.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started