Question
Final Project Analysis technique: ANOVA to analyze if there is a difference in means among 3 groups. Regression to determine what statistics are most influential
Final Project
Analysis technique: ANOVA to analyze if there is a difference in means among 3 groups. Regression to determine what statistics are most influential in calculating On-Base performance (OBP). Dataset: mlb_players_18.csv
Variable | Description |
name | Name of baseball player |
team | Team of baseball player |
position | Position of baseball player. This variable will be used to develop the 3 groups |
games | Games played (can be used for sample size) |
AB | Number of Times player Batted (can be used for sample size) |
R | Runs |
H | Hits |
doubles | Doubles |
triples | Triples |
HR | Home Runs |
RBI | Runs Batted In |
walks | Walks |
strike_outs | Strike Outs |
stolen_bases | Stolen Bases |
caught_stealing_base | Caught Stealing a Base |
AVG | Average (Hits divided by At Bats) |
OBP | On Base Percentage. This variable will be used in the analysis to describe batter performance. |
SLG | Slugging percentage |
OPS | On Base plus Slugging percentage |
Scenarios:
- You are an analyst for a baseball organization and the team management wants to know if there are real differences between the batting performance of baseball players according to these 3 positions: outfielder, infielder, and catcher (C). For this study, batting performance is equivalent to the variable OBP in the dataset.
- Management wants to understand which variables are statistically significant to the calculation of OBP. The variables to consider are average (AGE), slugging percentage (SLG), and on base plus slugging (OPS).
Important Information for Scenario 1: ANOVA
- Variable of Interest: Batting Performance, represented by on base percentage shown in variable OBP
- Groups: Using the position variable, develop a new variable with 3 groups according to the following definitions:
- Outfield: If position variable = any one of the following: CF, LF, RF o Infield: If position variable = any one of the following: 1B, 2B, 3B, SS o Catcher: If position variable = C
- Note that any other position values can be ignored for this analysis
Important Information for Scenario 2: Regression
- Variable of Interest: Batting Performance, represented by on base percentage shown in variable OBP
- Pitchers (position = P) should be ignored because their on base percentage is not relevant
- The explanatory variables to consider are average (AGE), slugging percentage (SLG), and on base plus slugging (OPS).
Instructions
report to answer the management's two questions identified in the scenarios above. Summary paragraph which succinctly answers the management's questions. This summary can be thought of as a conclusion that is either presented first or last. For ANOVA, your conclusion should be based on the evaluation of the null hypothesis in relation to your p-value and your chosen significance level. For Regression, your conclusion should be based on the p-values of each of the explanatory variables. (20 points, 10 for ANOVA conclusion and 10 for Regression conclusion).
- Problem Statements: Make sure to state the problems you were asked to solve. (10 points)
- Assumptions: Clearly define any assumptions you made. For instance, did you remove any outliers in your data due to small sample size (sample size can be determined by either the games or AB [at bats] variables). (10 points, 5 points for ANOVA assumptions and 5 points for Regression assumptions)
- Key charts:develop at least one chart from your dataset for each the ANOVA and the Regression questions. The chart should be placed in the report where it accompanies any related insights. Make sure to give your chart a clear and descriptive Title. (Possible Idea for ANOVA: Box and Whisker Plot of OBP by each of the 3 groups to aid in outlier detection). (20 points, 10 for ANOVA chart and 10 for Regression chart)
- Analysis Technique:
- ANOVA (50 points): results should include the null hypothesis under consideration, the F-statistic, p-value, and degrees of freedom. Include your entire ANOVA table from your statistical software results (Excel, R, Python, etc.) or show your manual calculations if you did not use the ANOVA functions in any statistical software. (10 points for stating the null hypothesis, 10 points for the F-statistic, 10 points for the p-value, 10 points for the degrees of freedom, and 10 points for either the ANOVA table or your manual calculations)
- Regression (40 points): results should include the p-value for each of the 3 explanatory variables AVG, SLG, and OPS (10 points for each p-value), and the adjusted R2 (10 points)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started