Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jun 11, 2024

Python Script : To complete the tasks listed below, open the Project Three Jupyter Notebook link in the Assignment Information module.This notebook contains your data

Python Script: To complete the tasks listed below, open the Project Three Jupyter Notebook link in the Assignment Information module.This notebook contains your data set and the Python scripts for your project. In the notebook, you will find step-by-step instructions and code blocks that will help you complete the following tasks:

Simple Linear Regression
Createscatterplots
Compute thecorrelation coefficient
Conduct alinear regression
Multiple Regression
Createscatterplots
Compute thecorrelation matrix
Conduct amultiple regressionanalysis

Summary Report: Once you have completed all the steps in your Python script, you will create a summary report to present your findings. Use the provided template to create your report. You must completeeachof the following sections:

Introduction: Set the context for your scenario and the analyses you will be performing.
Scatterplots and Correlation: Discuss relationships between variables using scatterplots and correlation coefficients.
Simple Linear Regression: Create a simple linear regression model to predict the response variable.
Multiple Regression: Create a multiple regression model to predict the response variable.
Conclusion: Summarize your findings and explain their practical implications.

Project Three: Simple Linear Regression and Multiple Regression

This notebook contains step-by-step directions for Project Three. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to write your summary report.

You are a data analyst for a basketball team and have access to a large set of historical data that you can use to analyze performance patterns. The coach of the team and your management have requested that you come up with regression models that predict the total number of wins for a team in the regular season based on key performance metrics. Although the data set is the same that you used in the previous projects, the data set used here has been aggregated to study the total number of wins in a regular season based on performance metrics shown in the table below. These regression models will help make key decisions to improve the performance of the team. You will use the Python programming language to perform the statistical analyses and then prepare a report of your findings to present for the team's management. Since the managers are not data analysts, you will need to interpret your findings and describe their practical implications.

There are four important variables in the data set that you will utilize in Project Three.

Variable

What does it represent

total_wins

Total number of wins in a regular season

avg_pts

Average points scored in a regular season

avg_elo_n

Average relative skill of each team in a regular season

avg_pts_differential

Average point differential between the team and their opponents in a regular season

The average relative skill (represented by the variableavg_elo_nin the data set) is simply the average of a team's relative skill in a regular season. Relative skill is measured using the ELO rating. This measure is inferred based on the final score of a game, the game location, and the outcome of the game relative to the probability of that outcome. The higher the number, the higher the relative skill of a team.

Reminder: It may be beneficial to review the summary report document for Project Three prior to starting this Python script. That will give you an idea of the questions you will need to answer with the outputs of this script.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 1: Data Preparation

This step uploads the data set from a CSV file and transforms the data into a form that will be used to create regression models. The data will be aggregated to calculate the number of wins for teams in a basketball regular season between the years 1995 and 2015.

Click the block of code below and hit theRunbutton above.

In[1]:

import numpy as np import pandas as pd import scipy.stats as st import matplotlib.pyplot as plt from IPython.display import display, HTML  # dataframe for this project nba_wins_df = pd.read_csv('nba_wins_data.csv')  display(HTML(nba_wins_df.head().to_html())) print("printed only the first five observations...") print("Number of rows in the dataset =", len(nba_wins_df))

year_idfran_idavg_ptsavg_opp_ptsavg_elo_navg_opp_elo_navg_pts_differentialavg_elo_differentialtotal_wins01995Bucks99.341463103.7073171368.6047891497.311587-4.365854-128.7067983411995Bulls101.52439096.6951221569.8921291488.1993524.82926881.6927774721995Cavaliers90.45122089.8292681542.4333911498.8482610.62195143.5851304331995Celtics102.780488104.6585371431.3075321495.936224-1.878049-64.6286933541995Clippers96.670732105.8292681309.0537011517.260260-9.158537-208.20655817

printed only the first five observations... Number of rows in the dataset = 618

Step 2: Scatterplot and Correlation for the Total Number of Wins and Average Points Scored

Your coach expects teams to win more games in a regular season if they have a high average number of points compared to their opponents. This is because the chances of winning are higher if a team scores high in its games. Therefore, it is expected that the total number of wins and the average points scored are correlated. Calculate the Pearson correlation coefficient and its P-value. Make the following edits to the code block below:

Replace??DATAFRAME_NAME??with the name of the dataframe used in this project.See Step 1 for the name of dataframe used in this project.
Replace??POINTS??with the name of the variable for average points scored in a regular season.See the table included in the Project Three instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name isvar1then replace ??POINTS?? with 'var1'.
Replace??WINS??with the name of the variable for the total number of wins in a regular season. Remember to enclose the variable in single quotes.See the table included in the Project Three instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name isvar2then replace ??WINS?? with 'var2'.

The code block below will print a scatterplot of the total number of wins against the average points scored in a regular season.

After you are done with your edits, click the block of code below and hit theRunbutton above.

In[2]:

import scipy.stats as st  # ---- TODO: make your edits here ---- plt.plot(nba_wins_df['avg_pts'], nba_wins_df['total_wins'], 'o')  plt.title('Total Number of Wins by Average Points Scored', fontsize=20) plt.xlabel('Average Points Scored') plt.ylabel('Total Number of Wins') plt.show()   # ---- TODO: make your edits here ---- correlation_coefficient, p_value = st.pearsonr(nba_wins_df['avg_pts'], nba_wins_df['total_wins'])  print("Correlation between Average Points Scored and the Total Number of Wins ") print("Pearson Correlation Coefficient =", round(correlation_coefficient,4)) print("P-value =", round(p_value,4))

Correlation between Average Points Scored and the Total Number of Wins Pearson Correlation Coefficient = 0.4777 P-value = 0.0

Step 3: Simple Linear Regression: Predicting the Total Number of Wins using Average Points Scored

The coach of your team suggests a simple linear regression model with the total number of wins as the response variable and the average points scored as the predictor variable. He expects a team to have more wins in a season if it scores a high average points during that season. This regression model will help your coach predict how many games your team might win in a regular season if they maintain a certain average score. Create this simple linear regression model. Make the following edits to the code block below:

Replace??RESPONSE_VARIABLE??with the variable name that is being predicted.See the table included in the Project Three instructions above to pick the variable name. Do not enclose this variable in quotes. For example, if the variable name isvar1then replace ??RESPONSE_VARIABLE?? with var1.
Replace??PREDICTOR_VARIABLE??with the variable name that is the predictor.See the table included in Project Three instructions above to pick the variable name. Do not enclose this variable in quotes. For example, if the variable name isvar2then replace ??PREDICTOR_VARIABLE?? with var2.

For example, if the variable names arevar1for the response variable andvar2for the predictor variable, then the expression in the code block below should be: model = smf.ols('var1 ~ var2', nba_wins_df).fit()

After you are done with your edits, click the block of code below and hit theRunbutton above.

In[3]:

import statsmodels.formula.api as smf  # Simple Linear Regression # ---- TODO: make your edits here --- model1 = smf.ols('total_wins ~ avg_pts', nba_wins_df).fit() print(model1.summary()) OLS Regression Results ============================================================================== Dep. Variable: total_wins R-squared: 0.228 Model: OLS Adj. R-squared: 0.227 Method: Least Squares F-statistic: 182.1 Date: Wed, 12 Aug 2020 Prob (F-statistic): 1.52e-36 Time: 04:05:53 Log-Likelihood: -2385.4 No. Observations: 618 AIC: 4775. Df Residuals: 616 BIC: 4784. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -85.5476 9.305 -9.194 0.000 -103.820 -67.275 avg_pts 1.2849 0.095 13.495 0.000 1.098 1.472 ============================================================================== Omnibus: 24.401 Durbin-Watson: 1.768 Prob(Omnibus): 0.000 Jarque-Bera (JB): 11.089 Skew: -0.033 Prob(JB): 0.00391 Kurtosis: 2.347 Cond. No. 1.97e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.97e+03. This might indicate that there are strong multicollinearity or other numerical problems.

Step 4: Scatterplot and Correlation for the Total Number of Wins and Average Relative Skill

Your management expects the team to win more games in the regular season if it maintains a high average relative skill compared to other teams. Therefore, it is expected that the total number of wins and the average relative skill are correlated. Calculate the Pearson correlation coefficient and its P-value. Make the following edits to the code block below:

Replace??DATAFRAME_NAME??with the name of the dataframe used in this project.See Step 1 for the name of dataframe used in this project.
Replace??RELATIVE_SKILL??with the name of the variable for average relative skill in a regular season.See the table included in the Project Three instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name isvar2then replace ??RELATIVE_SKILL?? with 'var2'.
Replace??WINS??with the name of the variable for total number of wins in a regular season.See the table included in Project Three instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name isvar3then replace ??WINS?? with 'var3'.

The code block below will print a scatterplot of the total number of wins against the average relative skill in a regular season.

After you are done with your edits, click the block of code below and hit theRunbutton above.

In[4]:

# ---- TODO: make your edits here --- plt.plot(nba_wins_df['avg_elo_n'], nba_wins_df['total_wins'], 'o')  plt.title('Wins by Average Relative Skill', fontsize=20) plt.xlabel('Average Relative Skill') plt.ylabel('Total Number of Wins') plt.show()  # ---- TODO: make your edits here --- correlation_coefficient, p_value = st.pearsonr(nba_wins_df['avg_elo_n'], nba_wins_df['total_wins'])  print("Correlation between Average Relative Skill and Total Number of Wins ") print("Pearson Correlation Coefficient =", round(correlation_coefficient,4)) print("P-value =", round(p_value,4))

Correlation between Average Relative Skill and Total Number of Wins Pearson Correlation Coefficient = 0.9072 P-value = 0.0

Step 5: Multiple Regression: Predicting the Total Number of Wins using Average Points Scored and Average Relative Skill

Instead of presenting a simple linear regression model to the coach, you can suggest a multiple regression model with the total number of wins as the response variable and the average points scored and the average relative skill as predictor variables. This regression model will help your coach predict how many games your team might win in a regular season based on metrics like the average points scored and average relative skill. This model is more practical because you expect more than one performance metric to determine the total number of wins in a regular season. Create this multiple regression model. Make the following edits to the code block below:

Replace??RESPONSE_VARIABLE??with the variable name that is being predicted.See the table included in the Project Three instructions above. Do not enclose this variable in quotes. For example, if the variable name isvar0then replace ??RESPONSE_VARIABLE?? with var0.
Replace??PREDICTOR_VARIABLE_1??with the variable name for average points scored.Hint: See the table included in the Project Three instructions above. Do not enclose this variable in quotes. For example, if the variable name isvar1then replace ??PREDICTOR_VARIABLE_1?? with var1.
Replace??PREDICTOR_VARIABLE_2??with the variable name for average relative skill.Hint: See the table included in the Project Three instructions above. Do not enclose this variable in quotes. For example, if the variable name isvar2then replace ??PREDICTOR_VARIABLE_2?? with var2.

For example, if the variable names arevar0for the response variable andvar1,var2for the predictor variables, then the expression in the code block below should be: model = smf.ols('var0 ~ var1 + var2', nba_wins_df).fit()

After you are done with your edits, click the block of code below and hit theRunbutton above.

In[5]:

import statsmodels.formula.api as smf  # Multiple Regression # ---- TODO: make your edits here --- model2 = smf.ols('total_wins ~ avg_pts + avg_elo_n', nba_wins_df).fit() print(model2.summary()) OLS Regression Results ============================================================================== Dep. Variable: total_wins R-squared: 0.837 Model: OLS Adj. R-squared: 0.837 Method: Least Squares F-statistic: 1580. Date: Wed, 12 Aug 2020 Prob (F-statistic): 4.41e-243 Time: 04:10:22 Log-Likelihood: -1904.6 No. Observations: 618 AIC: 3815. Df Residuals: 615 BIC: 3829. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -152.5736 4.500 -33.903 0.000 -161.411 -143.736 avg_pts 0.3497 0.048 7.297 0.000 0.256 0.444 avg_elo_n 0.1055 0.002 47.952 0.000 0.101 0.110 ============================================================================== Omnibus: 89.087 Durbin-Watson: 1.203 Prob(Omnibus): 0.000 Jarque-Bera (JB): 160.540 Skew: -0.869 Prob(JB): 1.38e-35 Kurtosis: 4.793 Cond. No. 3.19e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.19e+04. This might indicate that there are strong multicollinearity or other numerical problems.

Step 6: Multiple Regression: Predicting the Total Number of Wins using Average Points Scored, Average Relative Skill, and Average Points Differential

The coach also wants you to consider the average points differential as a predictor variable in the multiple regression model. Create a multiple regression model with the total number of wins as the response variable, and average points scored, average relative skill, and average points differential as predictor variables. This regression model will help your coach predict how many games your team might win in a regular season based on metrics like the average score, average relative skill, and the average points differential between the team and their opponents.

You are to write this code block yourself.

Use Step 5 to help you write this code block. Here is some information that will help you write this code block. Reach out to your instructor if you need help.

The dataframe used in this project is called nba_wins_df.
The variableavg_ptsrepresents average points scored by each team in a regular season.
The variableavg_elo_nrepresents average relative skill of each team in a regular season.
The variableavg_pts_differentialrepresents average points differential between each team and their opponents in a regular season.
Print the model summary.

Write your code in the code block section below. After you are done, click this block of code and hit theRunbutton above. Reach out to your instructor if you need more help with this step.

In[8]:

# Write your code in this code block section import statsmodels.formula.api as smf  mode12 = smf.ols('total_wins ~ avg_pts + avg_elo_n + avg_pts_differential', nba_wins_df).fit() print(mode12.summary())  OLS Regression Results ============================================================================== Dep. Variable: total_wins R-squared: 0.876 Model: OLS Adj. R-squared: 0.876 Method: Least Squares F-statistic: 1449. Date: Wed, 12 Aug 2020 Prob (F-statistic): 5.03e-278 Time: 04:21:04 Log-Likelihood: -1819.8 No. Observations: 618 AIC: 3648. Df Residuals: 614 BIC: 3665. Df Model: 3 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- Intercept -35.8921 9.252 -3.879 0.000 -54.062 -17.723 avg_pts 0.2406 0.043 5.657 0.000 0.157 0.324 avg_elo_n 0.0348 0.005 6.421 0.000 0.024 0.045 avg_pts_differential 1.7621 0.127 13.928 0.000 1.514 2.011 ============================================================================== Omnibus: 181.805 Durbin-Watson: 0.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 506.551 Skew: -1.452 Prob(JB): 1.01e-110 Kurtosis: 6.352 Cond. No. 7.51e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 7.51e+04. This might indicate that there are strong multicollinearity or other numerical problems.