Digit Classification with KNN and Naive Bayes This tells matplotlib not to try opening a new window for each plot matplotlib inline Import a bunch of libraries import time import numpy as np import matplotlib pyplot as plt from matplotlib ticker import MultipleLocator from sklearn pipeline import Pipeline from sklearn datasets import fetch mldata from sklearn neighbors import KNeighborsClassifier from sklearn metrics import confusion matrix from sklearn linear model import LinearRegression from sklearn naive bayes import BernoulliNB from sklearn naive bayes import MultinomialNB from sklearn naive bayes import GaussianNB from sklearn grid search import GridSearchCV from sklearn metrics import classification report Set the randomizer seed so results are the same each time np random seed(0) Load the digit data either from mldata org, or once downloaded to data home, from disk The data is about 53MB so this cell should take a while the first time your run it mnist fetch mldata('MNIST original', data home ' datasets mnist') X, Y mnist data, mnist target Rescale grayscale values to 0,1 X X 255 0 Shuffle the input create a random permutation of the integers between 0 and the number of data points and apply this permutation to X and Y NOTE Each time you run this cell, you'll re shuffle the data, resulting in a different ordering shuffle np random permutation(np arange(X shape 0 )) X, Y X shuffle , Y shuffle print ('data shape ', X shape) print ('label shape ', Y shape) Set some variables to hold test, dev, and training data test data, test labels X 61000 , Y 61000 dev data, dev labels X 60000 61000 , Y 60000 61000 train data, train labels X 60000 , Y 60000 mini train data, mini train labels X 1000 , Y 1000 (1) Create a 10x10 grid to visualize 10 examples of each digit Python hints plt rc() for setting the colormap, for example to black and white plt subplot() for creating subplots plt imshow() for rendering a matrix np array reshape() for reshaping a 1D feature vector into a 2D matrix (for rendering) def P1(num examples 10) (2) Evaluate a K Nearest Neighbors model with k 1,3,5,7,9 using the mini training set Report accuracy on the dev set For k 1, show precision, recall, and F1 for each label Which is the most difficult digit KNeighborsClassifier() for fitting and predicting classification report() for producing precision, recall, F1 results (3) Using k 1, report dev set accuracy for the training set sizes below Also, measure the amount of time needed for prediction with each training size time time() gives a wall clock value you can use for timing operations (4) Fit a regression model that predicts accuracy from training size What does it predict for n 60000 What's wrong with using regression here Can you apply a transformation that makes the predictions more reasonable Remember that the sklearn fit() functions take an input matrix X and output vector Y So each input example in X is a vector, even if it contains only a single value (5) Fit a 1 NN and output a confusion matrix for the dev data Use the confusion matrix to identify the most confused pair of digits, and display a few example mistakes confusion matrix() produces a confusion matrix (6) A common image processing technique is to smooth an image by blurring The idea is that the value of a particular pixel is estimated as the weighted combination of the original value and the values around it Typically, the blurring is Gaussian that is, the weight of a pixel's influence is determined by a Gaussian function over the distance to the relevant pixel Implement a simplified Gaussian blur by just using the 8 neighboring pixels the smoothed value of a pixel is a weighted combination of the original value and the 8 neighboring values Try applying your blur filter in 3 ways preprocess the training data but not the dev data preprocess the dev data but not the training data preprocess both training and dev data Note that there are Guassian blur filters available, for example in scipy ndimage filters You're welcome to experiment with those, but you are likely to get the best results with the simplified version I described above (7) Fit a Naive Bayes classifier and report accuracy on the dev data Remember that Naive Bayes estimates P(feature label) While sklearn can handle real valued features, let's start by mapping the pixel values to either 0 or 1 You can do this as a preprocessing step, or with the binarize argument With binary valued features, you can use BernoulliNB Next try mapping the pixel values to 0, 1, or 2, representing white, grey, or black This mapping requires MultinomialNB Does the multi class version improve the results Why or why not (8) Use GridSearchCV to perform a search over values of alpha (the Laplace smoothing parameter) in a Bernoulli NB model What is the best value for alpha What is the accuracy when alpha 0 Is this what you'd expect Note that GridSearchCV partitions the training data so the results will be a bit different than if you used the dev data for evaluation (9) Try training a model using GuassianNB, which is intended for real valued features, and evaluate on the dev data You'll notice that it doesn't work so well Try to diagnose the problem You should be able to find a simple fix that returns the accuracy to around the same rate as BernoulliNB Explain your solution Hint examine the parameters estimated by the fit() method, theta and sigma (10) Because Naive Bayes is a generative model, we can use the trained model to generate digits Train a BernoulliNB model and then generate a 10x20 grid with 20 examples of each digit Because you're using a Bernoulli model, each pixel output will be either 0 or 1 How do the generated digits compare to the training digits You can use np random rand() to generate random numbers from a uniform distribution The estimated probability of each pixel is stored in feature log prob You'll need to use np exp() to convert a log probability back to a probability (11) Remember that a strongly calibrated classifier is rougly 90 accurate when the posterior probability of the predicted class is 0 9 A weakly calibrated classifier is more accurate when the posterior is 90 than when it is 80 A poorly calibrated classifier has no positive correlation between posterior and accuracy Train a BernoulliNB model with a reasonable alpha value For each posterior bucket (think of a bin in a histogram), you want to estimate the classifier's accuracy So for each prediction, find the bucket the maximum posterior belongs to and update the correct and total counters How would you characterize the calibration for the Naive Bayes model (12) EXTRA CREDIT Try designing extra features to see if you can improve the performance of Naive Bayes on the dev set Here are a few ideas to get you started Try summing the pixel values in each row and each column Try counting the number of enclosed regions 8 usually has 2 enclosed regions, 9 usually has 1, and 7 usually has 0

Question

Digit Classification with KNN and Naive Bayes   This tells matplotlib not to try opening a new window for each plot   matplotlib inline   Import a bunch of libraries  import time import numpy as np import matplotlib pyplot as plt from matplotlib ticker import MultipleLocator from sklearn pipeline import Pipeline from sklearn datasets import fetch mldata from sklearn neighbors import KNeighborsClassifier from sklearn metrics import confusion matrix from sklearn linear model import LinearRegression from sklearn naive bayes import BernoulliNB from sklearn naive bayes import MultinomialNB from sklearn naive bayes import GaussianNB from sklearn grid search import GridSearchCV from sklearn metrics import classification report   Set the randomizer seed so results are the same each time  np random seed(0)   Load the digit data either from mldata org, or once downloaded to data home, from disk  The data is about 53MB so this cell   should take a while the first time your run it  mnist   fetch mldata('MNIST original', data home '  datasets mnist') X, Y   mnist data, mnist target   Rescale grayscale values to  0,1   X   X   255 0   Shuffle the input  create a random permutation of the integers between 0 and the number of data points and apply this   permutation to X and Y    NOTE  Each time you run this cell, you'll re shuffle the data, resulting in a different ordering  shuffle   np random permutation(np arange(X shape 0 )) X, Y   X shuffle , Y shuffle  print ('data shape  ', X shape) print ('label shape ', Y shape)   Set some variables to hold test, dev, and training data  test data, test labels   X 61000  , Y 61000   dev data, dev labels   X 60000 61000 , Y 60000 61000  train data, train labels   X  60000 , Y  60000  mini train data, mini train labels   X  1000 , Y  1000  (1) Create a 10x10 grid to visualize 10 examples of each digit  Python hints    plt rc() for setting the colormap, for example to black and white   plt subplot() for creating subplots   plt imshow() for rendering a matrix   np array reshape() for reshaping a 1D feature vector into a 2D matrix (for rendering)  def P1(num examples 10)  (2) Evaluate a K Nearest Neighbors model with k    1,3,5,7,9  using the mini training set  Report accuracy on the dev set  For k 1, show precision, recall, and F1 for each label  Which is the most difficult digit  KNeighborsClassifier() for fitting and predicting classification report() for producing precision, recall, F1 results (3) Using k 1, report dev set accuracy for the training set sizes below  Also, measure the amount of time needed for prediction with each training size  time time() gives a wall clock value you can use for timing operations (4) Fit a regression model that predicts accuracy from training size  What does it predict for n 60000  What's wrong with using regression here  Can you apply a transformation that makes the predictions more reasonable  Remember that the sklearn fit() functions take an input matrix X and output vector Y  So each input example in X is a vector, even if it contains only a single value (5) Fit a 1 NN and output a confusion matrix for the dev data  Use the confusion matrix to identify the most confused pair of digits, and display a few example mistakes  confusion matrix() produces a confusion matrix (6) A common image processing technique is to smooth an image by blurring  The idea is that the value of a particular pixel is estimated as the weighted combination of the original value and the values around it  Typically, the blurring is Gaussian    that is, the weight of a pixel's influence is determined by a Gaussian function over the distance to the relevant pixel  Implement a simplified Gaussian blur by just using the 8 neighboring pixels  the smoothed value of a pixel is a weighted combination of the original value and the 8 neighboring values  Try applying your blur filter in 3 ways  preprocess the training data but not the dev data preprocess the dev data but not the training data preprocess both training and dev data Note that there are Guassian blur filters available, for example in scipy ndimage filters  You're welcome to experiment with those, but you are likely to get the best results with the simplified version I described above  (7) Fit a Naive Bayes classifier and report accuracy on the dev data  Remember that Naive Bayes estimates P(feature label)  While sklearn can handle real valued features, let's start by mapping the pixel values to either 0 or 1  You can do this as a preprocessing step, or with the binarize argument  With binary valued features, you can use BernoulliNB  Next try mapping the pixel values to 0, 1, or 2, representing white, grey, or black  This mapping requires MultinomialNB  Does the multi class version improve the results  Why or why not  (8) Use GridSearchCV to perform a search over values of alpha (the Laplace smoothing parameter) in a Bernoulli NB model  What is the best value for alpha  What is the accuracy when alpha 0  Is this what you'd expect  Note that GridSearchCV partitions the training data so the results will be a bit different than if you used the dev data for evaluation  (9) Try training a model using GuassianNB, which is intended for real valued features, and evaluate on the dev data  You'll notice that it doesn't work so well  Try to diagnose the problem  You should be able to find a simple fix that returns the accuracy to around the same rate as BernoulliNB  Explain your solution  Hint  examine the parameters estimated by the fit() method, theta  and sigma   (10) Because Naive Bayes is a generative model, we can use the trained model to generate digits  Train a BernoulliNB model and then generate a 10x20 grid with 20 examples of each digit  Because you're using a Bernoulli model, each pixel output will be either 0 or 1  How do the generated digits compare to the training digits  You can use np random rand() to generate random numbers from a uniform distribution The estimated probability of each pixel is stored in feature log prob   You'll need to use np exp() to convert a log probability back to a probability  (11) Remember that a strongly calibrated classifier is rougly 90  accurate when the posterior probability of the predicted class is 0 9  A weakly calibrated classifier is more accurate when the posterior is 90  than when it is 80   A poorly calibrated classifier has no positive correlation between posterior and accuracy  Train a BernoulliNB model with a reasonable alpha value  For each posterior bucket (think of a bin in a histogram), you want to estimate the classifier's accuracy  So for each prediction, find the bucket the maximum posterior belongs to and update the  correct  and  total  counters  How would you characterize the calibration for the Naive Bayes model  (12) EXTRA CREDIT Try designing extra features to see if you can improve the performance of Naive Bayes on the dev set  Here are a few ideas to get you started  Try summing the pixel values in each row and each column  Try counting the number of enclosed regions  8 usually has 2 enclosed regions, 9 usually has 1, and 7 usually has 0

Accepted Answer

The Answer is in the image, click to view ...

Question

Digit Classification with KNN and Naive Bayes # This tells matplotlib not to try opening a new window for each plot. %matplotlib inline # Import

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Database Internals A Deep Dive Into How Distributed Data Systems Work

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question