AA1 csv Download the csv file Write a script in python (you can use any IDE) to load the csv file AA1into a Pandas data frame, name the frame df firstname (where firstname is your firrstname) In your script carry out the following, and then answer the last set of questions in point 5 Analysis in the html box (Note Once your script is ready please attach the python script and the required screenshot(s) to this question by clicking the Add file button and then follow the notes to upload your script) Explore the data Print the names of columns Print the types of columns Print the unique values in each column Print the statistics count, min, mean, standard deviation, 1 st quartile, median, 3 rd quartile max of all the numeric columns(use one command) Print the first four records Print a summary of all missing values in all columns (use one command) Print the total number (count) of each unique value in the following categorical columns Model Color Visualize the data Plot a histogram for themillageuse 10 bins, name the x and y axis' appropriately, give the plot a title firstname millage Create a scatterplot showing millage versus value , name the x and y axis' appropriately, give the plot a title firstname millage scatter Plot a scatter matrix showing the relationship between all columns of the dataset on the diagonal of the matrix plot the kernel density function Pre process the data Remove (drop) properly the column with the most missing values (hint make sure you review and set the right arguments) Replace the missing values in the millage column with the mean average of the column value Check that there are no missing values Convert the all the categorical columns into numeric values and drop delete the original columns (hint use get dummies) Make sure your new data frame is completely numeric, name it df firstname numeric Build a model and validate Build a predictive model, namely a tree classifier using sklearn take into consideration the following Name the model dt firstname where firstname is your firstname Split your data 70 for training and 30 for testing Use entropy for the decisions Maximum depth of the tree is 6 Split the node only when you reach 15 observations per node For validation use 8 fold cross validation and print the mean of accuracy of the validation Use the model you created using the training data to test the 30 testing data, print The accuracy of the test The confusion matrix Take a screenshot illustrating the accuracy of the test and the confusion matrix name it firstname screenshotAA1 Prune the tree Vary the maximum depth of your predictive model from 1 to 8 and print the mean accuracy of the k fold of each run on the training data 5 Analysis In the below box answer the following three questions, number your responses based on the question numbers What are the key highlights of the original dataset, you loaded Based on the results of pruning the tree recommend the maximum depth and explain why you are recommending such Looking at the confusion matrix you generated in point 4 7 what are the key findings (Hint think in terms of precision, re call, True negatives, )

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jun 21, 2024

AA1.csv Download the csv file .Write a script in python (you can use any IDE) to load the csv file AA1into a Pandas data frame,

AA1.csv

Download the csv file .Write a script in python (you can use any IDE) to load the csv file AA1into a Pandas data frame, name the frame df_firstname(where firstname is your firrstname). In your script carry out the following, and then answer the last set of questions in point 5 Analysis in the html box:

(Note: Once your script is ready please attach the python script and the required screenshot(s) to this question by clicking the "Add file" button and then follow the notes to upload your script).

Explore the data

Print the names of columns
Print the types of columns
Print the unique values in each column.
Print the statistics count, min, mean, standard deviation, 1^stquartile, median, 3^rdquartile max of all the numeric columns(use one command).
Print the first four records.
Print a summary of all missing values in all columns (use one command).
Print the total number (count) of each unique value in the following categorical columns:
1. Model
2. Color
Visualize the data
1. Plot a histogram for themillageuse 10 bins, name the x and y axis' appropriately, give the plot a title "firstname_millage".
2. Create a scatterplot showing "millage" versus "value", name the x and y axis' appropriately, give the plot a title "firstname_millage_scatter".
3. Plot a "scatter matrix" showing the relationship between all columns of the dataset on the diagonal of the matrix plot the kernel density function.
Pre-process the data
1. Remove (drop) properly the column with the most missing values. (hint: make sure you review and set the right arguments)
2. Replace the missing values in the "millage" column with the mean average of the column value.
3. Check that there are no missing values.
4. Convert the all the categorical columns into numeric values and drop/delete the original columns. (hint: use get dummies)
5. Make sure your new data frame is completely numeric, name it df_firstname_numeric.
Build a model and validate
1. Build a predictive model, namely a tree classifier using sklearn take into consideration the following:
2. Name the model dt_firstname where firstname is your firstname
3. Split your data 70% for training and 30% for testing
4. Use entropy for the decisions
5. Maximum depth of the tree is 6
6. Split the node only when you reach 15 observations per node.
7. For validation use 8 -fold cross validation and print the mean of accuracy of the validation.
8. Use the model you created using the training data to test the 30% testing data, print :
  1. The accuracy of the test
  2. The confusion matrix
9. Take a screenshot illustrating the accuracy of the test and the confusion matrix name it firstname_screenshotAA1.
10. Prune the tree: Vary the maximum depth of your predictive model from 1 to 8 and print the mean accuracy of the k-fold of each run on the training data.

5. Analysis

In the below box answer the following three questions, number your responses based on the question numbers:

What are the key highlights of the original dataset, you loaded.
Based on the results of pruning the tree recommend the maximum depth and explain why you are recommending such.
Looking at the confusion matrix you generated in point 4.7 what are the key findings (Hint: think in terms of precision, re-call, True negatives,.....)?