Question

1 Approved Answer

Posted on Sep 22, 2024

This question uses the MNIST 784 dataset which is readily available online. TO DO: Modify the class above to implement a KNN classifier. There are

This question uses the MNIST 784 dataset which is readily available online.

TO DO: Modify the class above to implement a KNN classifier. There are three methods that you need to complete:

predict: Given an matrix of validation data with examples each with features, return a length- vector of predicted labels by calling the classify function on each example.

classify: Given a single query example with features, return its predicted class label as an integer using KNN by calling the majority function.

majority: Given an array of indices into the training set corresponding to the training examples that are nearest to the query point, return the majority label as an integer. If there is a tie for the majority label using nearest neighbors, reduce by 1 and try again. Continue reducing until there is a winning label.

Notes:

Don't even think about implementing nearest-neighbor search or any distance metrics yourself. Instead, go read the documentation for Scikit-Learn's BallTree object. You will find that its implemented query method can do most of the heavy lifting for you.

Do not use Scikit-Learn's KNeighborsClassifier in this problem. We're implementing this ourselves.

##Given code

import math import pickle import gzip import numpy as np import matplotlib.pylab as plt %matplotlib inline

# importing all the required libraries

from math import exp import numpy as np import pandas as pd import sklearn from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt %matplotlib inline

# This cell sets up the MNIST dataset

class MNIST_import: """ sets up MNIST dataset from OpenML """ def __init__(self): df = pd.read_csv("data/mnist_784.csv") # Create arrays for the features and the response variable # store for use later y = df['class'].values X = df.drop('class', axis=1).values # Convert the labels to numeric labels y = np.array(pd.to_numeric(y)) # create training and validation sets self.train_x, self.train_y = X[:5000,:], y[:5000] self.val_x, self.val_y = X[5000:6000,:], y[5000:6000] data = MNIST_import()

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

:param x_train: numpy array with shape (n_rows,1)- e.g. [[1,2],[3,4]] :param y_train: numpy array with shape (n_rows,)- e.g. [1,-1] :param K: The number of nearest points to consider in classification """ # Import and build the BallTree on training features from sklearn.neighbors import BallTree self.balltree = BallTree(x_train) # Cache training labels and parameter K self.y_train = y_train self.K = K def majority(self, neighbor_indices, neighbor_distances=None): """ Given indices of nearest neighbors in training set, return the majority label. Break ties by considering 1 fewer neighbor until a clear winner is found.

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """ # complete your code here def classify(self, x): """ Given a query point, return the predicted label :param x: a query point stored as an ndarray """ # complete your code here def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for

# complete your code here