Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 28, 2024

I did this assignment to complete a KNN classifier, but I'm having trouble to identify what's wrong with my code, specifically when trying to use

I did this assignment to complete a KNN classifier, but I'm having trouble to identify what's wrong with my code, specifically when trying to use the model with 3 neighbors.

This is the title of the assignment:

This question uses the MNIST 784 dataset which is readily available online.

Modify the class above to implement a KNN classifier. There are three methods that you need to complete:

predict: Given an matrix of validation data with examples each with features, return a length- vector of predicted labels by calling the classify function on each example.

classify: Given a single query example with features, return its predicted class label as an integer using KNN by calling the majority function.

majority: Given an array of indices into the training set corresponding to the training examples that are nearest to the query point, return the majority label as an integer. If there is a tie for the majority label using nearest neighbors, reduce by 1 and try again. Continue reducing until there is a winning label.

Notes:

Don't even think about implementing nearest-neighbor search or any distance metrics yourself. Instead, go read the documentation for Scikit-Learn's BallTree object. You will find that its implemented query method can do most of the heavy lifting for you.

Do not use Scikit-Learn's KNeighborsClassifier in this problem. We're implementing this ourselves.

This is my first attempt code:

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

:param x_train: numpy array with shape (n_rows,1)- e.g. [[1,2],[3,4]] :param y_train: numpy array with shape (n_rows,)- e.g. [1,-1] :param K: The number of nearest points to consider in classification """ # Import and build the BallTree on training features from sklearn.neighbors import BallTree self.balltree = BallTree(x_train) # Cache training labels and parameter K self.y_train = y_train self.K = K def majority(self, neighbor_indices, neighbor_distances=None): """ Given indices of nearest neighbors in training set, return the majority label. Break ties by considering 1 fewer neighbor until a clear winner is found.

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """ # your code here while self.K >= 1: # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:,:self.K].astype(int)], return_counts=True)

# Find the label with the highest frequency majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > self.K/2: return majority_label

# Reduce K and try again self.K -= 1 # If there's still a tie after reducing K to 1, return the label of the single nearest neighbor return self.y_train[neighbor_indices[0,0]] def classify(self, x): """ Given a query point, return the predicted label :param x: a query point stored as an ndarray """ # your code here # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K) # Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices, neighbor_distances) def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """ # your code here # Predict the label for each query point return np.array([self.classify(x) for x in X])

This is my second attempt code:

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """

# your code here max_neighbors = self.K * 2 num_neighbors = self.K while num_neighbors <= max_neighbors: # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:,:num_neighbors].astype(int)], return_counts=True)

majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > num_neighbors/2: return majority_label

# Increase the number of neighbors and try again num_neighbors += 1 return self.y_train[neighbor_indices[0,0]]

def classify(self, x): """ Given a query point, return the predicted label :param x: a query point stored as an ndarray """ # your code here # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K) # Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices, neighbor_distances) def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """ # your code here # Predict the label for each query point return np.array([self.classify(x) for x in X])

This is my third attempt code:

class KNN: """ Class to store data for KNN classification """ def __init__(self, X_train, y_train, K=5): """ Creates a KNN instance

:param X_train: numpy array with shape (n_rows, n_features) :param y_train: numpy array with shape (n_rows,) :param K: The number of nearest points to consider in classification """

# Import and build the BallTree on training features self.balltree = BallTree(X_train)

# Cache training labels and parameter K self.y_train = y_train self.K = K

def majority(self, neighbor_indices, neighbor_distances=None): """ Given indices of nearest neighbors in training set, return the majority label. Break ties by considering more neighbors until a clear winner is found.

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """

max_neighbors = len(neighbor_indices.astype(int))

for num_neighbors in range(self.K, max_neighbors + 1): # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:num_neighbors].astype(int)], return_counts=True)

# Find the label with the highest frequency majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > num_neighbors / 2: return majority_label

# If there's still a tie after considering all neighbors, return the label of the single nearest neighbor return self.y_train[neighbor_indices[0].astype(int)]

def classify(self, x): """ Given a query point, return the predicted label

:param x: a query point stored as an ndarray """ # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K)

# Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices.astype(int), neighbor_distances)

def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """ # Predict the label for each query point return np.array([self.classify(x) for x in X])

The error i keep getting in all attempts:

AssertionError: Look at the KNN class. Does it perform as expected with three neighbors?

TO DO: What's wrong with my code? How can I fix this issue? Can somebody help me to make my model work with 3 neighbors?

I understand there's probably something wrong with my majority function in the KNN class. The function is not correctly handling cases where there is a tie for the majority label, but I don't know how to fix it. It seems like my code is having a problem when trying to use the model with 3 neighbors. It's an index problem? it's the vector? It's K value? I believe there's some tricky thing about 3 neighbors that I'm not seeing, cause the next task would be: Checking how well your classifier does Use your KNN class to perform KNN on the validation data with =3.

This is my import data:

class MNIST_import: """ sets up MNIST dataset from OpenML """ def __init__(self): df = pd.read_csv("data/mnist_784.csv") # Create arrays for the features and the response variable # store for use later y = df['class'].values X = df.drop('class', axis=1).values # Convert the labels to numeric labels y = np.array(pd.to_numeric(y)) # create training and validation sets self.train_x, self.train_y = X[:5000,:], y[:5000] self.val_x, self.val_y = X[5000:6000,:], y[5000:6000] data = MNIST_import()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Previous Question Next Question