Question

1 Approved Answer

Posted on Sep 12, 2024

Please help code the following inn python. Thank you All code must be executable from scratch. Code should be written properly (i.e. put code in

Please help code the following inn python. Thank you

All code must be executable from scratch. Code should be written properly (i.e. put code in functions as needed, declare variables as needed, and don't repeat yourself.) Only use functions from the packages loaded in the first block of code below for this problem set.

Make sure all plots contain appropriately labelled axes and are easy to read and interpret.

image text in transcribed Link to the Iris Flower Dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

image text in transcribed

Understanding the problem For this problem set we will be working with on with a popular data set used in biological modeling called the This dataset is typically used for classification, when we use different measurements from distinct flower species and try to correctly categorize them from their measurements. A variety of machine learning methods may be used on this dataset, but for this problem set we will focus on multivariate linear regression. Though linear regression is not explicitly designed for predicting species (classification) the linear regression framework is flexible enough to do it. Let us first import some package so we can easily work with the data. Fortunately, the data can be downloaded directly into our notebooks from the scikitlearn python package. [ ] \# Importing dataset from scikit-learn and other useful packages: from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import matplotlib.pyplot as plt import numpy as np \# we will fix a random seed for reproducibility: seed = 11 np.random.seed(seed) The next cell will load in the data. Here, I have written a vector that will tell you each of the measurements made from each flower species. Those measurements are, in the following order - Sepal Length - Sepal Width - Petal Length - Petal Width. Measurements from the dataset of the above features are stored in the variable xdat. This will be the input data or predictor variable. The variable y_dat, also declared below, will be the output data or response variable. These data will contain one of 3 integers corresponding to the three iris species in the dataset, specifically - Iris Setosa - Iris Versicolour - Iris Virginica [ ] iris_data = load_iris () feature_columns =[ 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' ] x_dat = iris_data[ 'data'] y_dat = iris_data[ 'target'] part a: Make the appropriate "design matrix" for multivariate linear regression for this question by inserting a column of ones to the begining of x data. \#\#\#\# Write your code below (don't forget to label any graphs) Part b: Split the data into training and testing pieces. You may use the included scikitlearn train and test function for this step. Please split using an 80% train/20\% test split. Use the training data to solve for the estimated regression weights using the ^=(XX)1XY formula. Be mindful of matrix dimensions! [ ] np.random. seed (seed) \#\#\#\# Write your code below (don't forget to label any graphs). I put pass in the loop as a placeholder, \#\#\#\# please remove it when you write your code kfolds =5 for - in range(kfolds): pass Calculate the mean squared error on the held-out data using the weights from above. [ ] np.random.seed (seed) \#\#\#\# Write your code below (don't forget to label any graphs) Part d Now, organize the above code in a sensible, functional way so you can perform many-fold cross-validation. Calculate the average test mean squared error on held out data over 5 fold cross validation. Do this several times, one for when we use one feature (or observation) or X, another for 2 , another for 3 , and another for 4 features. What do you notice about the cross-validated mean squared error as we include more and more x observations. Write your answer below Answer here: [ ] \#\#\#\# Write your code below (don't forget to label any graphs). I put pass in the loop as a placeholder, \#\#\#\# please remove it when you write your code n_feats =[1,2,3,4] for in n_feats: pass part e Finally, because this dataset should really be using a classification model, instead of a regression model, let's slightly modify our above example. Instead of calculating mean squared error on the held-out data, use np.round to make an estimated species from the predicted y_hat. Based on this nearest integer of y__hat, identify the *number of errors made on the 20% held-out data. Create the same plot as above, still averaged over 5 -fold cross validation, that shows the average number of errors made in species prediction on held-out data as you increase the number of features in X. [ ] \#\#\#\# Write your code below (don't forget to label any graphs) Challenge: Generate a similar plot as you did in part 2e, where each datapoint is a location in the 2 dimensional scatter plot. This time, however, indicate the classification performance on the plot (say, a + indicating correct classification, and '-' indicating incorrect classification. Do you notice regions of the feature space where errors are more likely to occur? Write a sentence interpreting your answer. \#\#\#\# Write your code below (don't forget to label any graphs) Answer here