Question
In Chapter 24 and Lectures #17 and #18, we worked through a logistic model to predict scoring attempts based on a single independent variable shot_distance,
In Chapter 24 and Lectures #17 and #18, we worked through a logistic model to predict scoring attempts based on a single independent variable shot_distance, as well as a second model that used multiple independent variables, ['shot_distance', 'minute', 'action_type', 'shot_type', 'opponent']. It was noted that the prediction accurracy increased from 0.6 using just shot_distance to 0.725 using the entire list. Are all of those additional variables necessary to get the increased accuracy?
For this program, write a function that identifies which variable increases the accuracy of the oringal model the most.
- def bestForPredict(df, columns, x_col = "shot_distance", y_col = "shot_made", test_size = 40, random_state = 42):: This function has six inputs:
- df: a DataFrame that including the specified columns.
- columns: a list of column names of the specified DataFrame.
- x_col: a column name of the specified DataFrame containing locations. It is one of the independent variables for the model (the other is from the list columns). It has a default value is "shot_distance".
- y_col: a column name of the specified DataFrame containing locations. This is the dependent variable (what's being predicted) in the model. It has a default value is "shot_made".
- test_size: the size of the test set created when the data is divided into test and training sets with train_test_split. The default value is 40.
- random_state: the random seed used when the data is divided into test and training sets with train_test_split. The default value is 42.
For example, assuming your function bestForPredict() was in the p37.py for the file lebron.csv, the code:
df = pd.read_csv('lebron.csv') columns = ['minute', 'action_type', 'shot_type', 'opponent'] acc,col_name = p37.bestForPredict(df,columns) print(f'The highest accuracy, {acc}, was obtained by including column, {col_name}.')
would print:
The highest accuracy, 0.725, was obtained by including column, action_type.
Another example with the same DataFrame:
columns = ['minute', 'opponent'] acc,col_name = p37.bestForPredict(df,columns, test_size = 100, random_state = 17) print(f'The highest accuracy, {acc}, was obtained by including column, {col_name}.')
would print:
The highest accuracy, 0.6, was obtained by including column, opponent.
Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing.
Hints:
- Some of the code in the textbook is deprecated. In particular, the as_matrix and orient='row' does not work in newer versions. Omit the former and replace the latter with:
rows = df[[x_col,c]].to_dict('records') onehot = DictVectorizer(sparse=False).fit(rows)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started