Question
AA1.csv Download the csv file .Write a script in python (you can use any IDE) to load the csv file AA1into a Pandas data frame,
AA1.csv
Download the csv file .Write a script in python (you can use any IDE) to load the csv file AA1into a Pandas data frame, name the frame df_firstname(where firstname is your firrstname). In your script carry out the following, and then answer the last set of questions in point 5 Analysis in the html box:
(Note: Once your script is ready please attach the python script and the required screenshot(s) to this question by clicking the "Add file" button and then follow the notes to upload your script).
Explore the data
- Print the names of columns
- Print the types of columns
- Print the unique values in each column.
- Print the statistics count, min, mean, standard deviation, 1stquartile, median, 3rdquartile max of all the numeric columns(use one command).
- Print the first four records.
- Print a summary of all missing values in all columns (use one command).
- Print the total number (count) of each unique value in the following categorical columns:
- Model
- Color
- Visualize the data
- Plot a histogram for themillageuse 10 bins, name the x and y axis' appropriately, give the plot a title "firstname_millage".
- Create a scatterplot showing "millage" versus "value", name the x and y axis' appropriately, give the plot a title "firstname_millage_scatter".
- Plot a "scatter matrix" showing the relationship between all columns of the dataset on the diagonal of the matrix plot the kernel density function.
- Pre-process the data
- Remove (drop) properly the column with the most missing values. (hint: make sure you review and set the right arguments)
- Replace the missing values in the "millage" column with the mean average of the column value.
- Check that there are no missing values.
- Convert the all the categorical columns into numeric values and drop/delete the original columns. (hint: use get dummies)
- Make sure your new data frame is completely numeric, name it df_firstname_numeric.
- Build a model and validate
- Build a predictive model, namely a tree classifier using sklearn take into consideration the following:
- Name the model dt_firstname where firstname is your firstname
- Split your data 70% for training and 30% for testing
- Use entropy for the decisions
- Maximum depth of the tree is 6
- Split the node only when you reach 15 observations per node.
- For validation use 8 -fold cross validation and print the mean of accuracy of the validation.
- Use the model you created using the training data to test the 30% testing data, print :
- The accuracy of the test
- The confusion matrix
- Take a screenshot illustrating the accuracy of the test and the confusion matrix name it firstname_screenshotAA1.
- Prune the tree: Vary the maximum depth of your predictive model from 1 to 8 and print the mean accuracy of the k-fold of each run on the training data.
5. Analysis
In the below box answer the following three questions, number your responses based on the question numbers:
- What are the key highlights of the original dataset, you loaded.
- Based on the results of pruning the tree recommend the maximum depth and explain why you are recommending such.
- Looking at the confusion matrix you generated in point 4.7 what are the key findings (Hint: think in terms of precision, re-call, True negatives,.....)?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started