In Chapter 24 and Lectures 17 and 18, we worked through a logistic model to predict scoring attempts based on a single independent variable shot distance, as well as a second model that used multiple independent variables, 'shot distance', 'minute', 'action type', 'shot type', 'opponent' It was noted that the prediction accurracy increased from 0 6 using just shot distance to 0 725 using the entire list Are all of those additional variables necessary to get the increased accuracy For this program, write a function that identifies which variable increases the accuracy of the oringal model the most def bestForPredict(df, columns, x col shot distance , y col shot made , test size 40, random state 42) This function has six inputs df a DataFrame that including the specified columns columns a list of column names of the specified DataFrame x col a column name of the specified DataFrame containing locations It is one of the independent variables for the model (the other is from the list columns) It has a default value is shot distance y col a column name of the specified DataFrame containing locations This is the dependent variable (what's being predicted) in the model It has a default value is shot made test size the size of the test set created when the data is divided into test and training sets with train test split The default value is 40 random state the random seed used when the data is divided into test and training sets with train test split The default value is 42 The function returns the highest prediction accuracy found from the columns inputted, as well as the name of the column that increases prediction accuracy the most For example, assuming your function bestForPredict() was in the p37 py for the file lebron csv, the code df pd read csv('lebron csv') columns 'minute', 'action type', 'shot type', 'opponent' acc,col name p37 bestForPredict(df,columns) print(f'The highest accuracy, acc , was obtained by including column, col name ') would print The highest accuracy, 0 725, was obtained by including column, action type Another example with the same DataFrame columns 'minute', 'opponent' acc,col name p37 bestForPredict(df,columns, test size 100, random state 17) print(f'The highest accuracy, acc , was obtained by including column, col name ') would print The highest accuracy, 0 6, was obtained by including column, opponent Note you should submit a file with only the standard comments at the top, this function, and any helper functions you have written The grading scripts will then import the file for testing Hints Some of the code in the textbook is deprecated In particular, the as matrix and orient 'row' does not work in newer versions Omit the former and replace the latter with rows df x col,c to dict('records') onehot DictVectorizer(sparse False) fit(rows)

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 20, 2024

In Chapter 24 and Lectures #17 and #18, we worked through a logistic model to predict scoring attempts based on a single independent variable shot_distance,

In Chapter 24 and Lectures #17 and #18, we worked through a logistic model to predict scoring attempts based on a single independent variable shot_distance, as well as a second model that used multiple independent variables, ['shot_distance', 'minute', 'action_type', 'shot_type', 'opponent']. It was noted that the prediction accurracy increased from 0.6 using just shot_distance to 0.725 using the entire list. Are all of those additional variables necessary to get the increased accuracy?

For this program, write a function that identifies which variable increases the accuracy of the oringal model the most.

def bestForPredict(df, columns, x_col = "shot_distance", y_col = "shot_made", test_size = 40, random_state = 42):: This function has six inputs:
- df: a DataFrame that including the specified columns.
- columns: a list of column names of the specified DataFrame.
- x_col: a column name of the specified DataFrame containing locations. It is one of the independent variables for the model (the other is from the list columns). It has a default value is "shot_distance".
- y_col: a column name of the specified DataFrame containing locations. This is the dependent variable (what's being predicted) in the model. It has a default value is "shot_made".
- test_size: the size of the test set created when the data is divided into test and training sets with train_test_split. The default value is 40.
- random_state: the random seed used when the data is divided into test and training sets with train_test_split. The default value is 42.
The function returns the highest prediction accuracy found from the columns inputted, as well as the name of the column that increases prediction accuracy the most.

For example, assuming your function bestForPredict() was in the p37.py for the file lebron.csv, the code:

df = pd.read_csv('lebron.csv') columns = ['minute', 'action_type', 'shot_type', 'opponent'] acc,col_name = p37.bestForPredict(df,columns) print(f'The highest accuracy, {acc}, was obtained by including column, {col_name}.')

would print:

The highest accuracy, 0.725, was obtained by including column, action_type.

Another example with the same DataFrame:

columns = ['minute', 'opponent'] acc,col_name = p37.bestForPredict(df,columns, test_size = 100, random_state = 17) print(f'The highest accuracy, {acc}, was obtained by including column, {col_name}.')

would print:

The highest accuracy, 0.6, was obtained by including column, opponent.

Note: you should submit a file with only the standard comments at the top, this function, and any helper functions you have written. The grading scripts will then import the file for testing.

Hints:

Some of the code in the textbook is deprecated. In particular, the as_matrix and orient='row' does not work in newer versions. Omit the former and replace the latter with:
```
rows = df[[x_col,c]].to_dict('records') onehot = DictVectorizer(sparse=False).fit(rows)
```