Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 29, 2024

I did this assignment to complete a KNN classifier, but I'm having trouble to identify what's wrong with my code, specifically when trying to use

I did this assignment to complete a KNN classifier, but I'm having trouble to identify what's wrong with my code, specifically when trying to use the model with 3 neighbors.

This is the title of the assignment:

This question uses the MNIST 784 dataset which is readily available online.

Modify the class above to implement a KNN classifier. There are three methods that you need to complete:

predict: Given an matrix of validation data with examples each with features, return a length- vector of predicted labels by calling the classify function on each example.

classify: Given a single query example with features, return its predicted class label as an integer using KNN by calling the majority function.

majority: Given an array of indices into the training set corresponding to the training examples that are nearest to the query point, return the majority label as an integer. If there is a tie for the majority label using nearest neighbors, reduce by 1 and try again. Continue reducing until there is a winning label.

Notes:

Don't even think about implementing nearest-neighbor search or any distance metrics yourself. Instead, go read the documentation for Scikit-Learn's BallTree object. You will find that its implemented query method can do most of the heavy lifting for you.

Do not use Scikit-Learn's KNeighborsClassifier in this problem. We're implementing this ourselves.

This is my first attempt code:

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

:param x_train: numpy array with shape (n_rows,1)- e.g. [[1,2],[3,4]] :param y_train: numpy array with shape (n_rows,)- e.g. [1,-1] :param K: The number of nearest points to consider in classification """ # Import and build the BallTree on training features from sklearn.neighbors import BallTree self.balltree = BallTree(x_train) # Cache training labels and parameter K self.y_train = y_train self.K = K def majority(self, neighbor_indices, neighbor_distances=None): """ Given indices of nearest neighbors in training set, return the majority label. Break ties by considering 1 fewer neighbor until a clear winner is found.

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """ while self.K >= 1: # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:,:self.K].astype(int)], return_counts=True)

# Find the label with the highest frequency majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > self.K/2: return majority_label

# Reduce K and try again self.K -= 1 # If there's still a tie after reducing K to 1, return the label of the single nearest neighbor return self.y_train[neighbor_indices[0,0]] def classify(self, x): """ Given a query point, return the predicted label :param x: a query point stored as an ndarray """ # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K) # Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices, neighbor_distances) def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """ # Predict the label for each query point return np.array([self.classify(x) for x in X])

This is my second attempt code:

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """ max_neighbors = self.K * 2 num_neighbors = self.K while num_neighbors <= max_neighbors: # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:,:num_neighbors].astype(int)], return_counts=True)

majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > num_neighbors/2: return majority_label

# Increase the number of neighbors and try again num_neighbors += 1 return self.y_train[neighbor_indices[0,0]]

def classify(self, x): """ Given a query point, return the predicted label :param x: a query point stored as an ndarray """ # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K) # Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices, neighbor_distances) def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """

# Predict the label for each query point return np.array([self.classify(x) for x in X])

This is my third attempt code:

class KNN: """ Class to store data for KNN classification """ def __init__(self, X_train, y_train, K=5): """ Creates a KNN instance

:param X_train: numpy array with shape (n_rows, n_features) :param y_train: numpy array with shape (n_rows,) :param K: The number of nearest points to consider in classification """

# Import and build the BallTree on training features self.balltree = BallTree(X_train)

# Cache training labels and parameter K self.y_train = y_train self.K = K

def majority(self, neighbor_indices, neighbor_distances=None): """ Given indices of nearest neighbors in training set, return the majority label. Break ties by considering more neighbors until a clear winner is found.

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """

max_neighbors = len(neighbor_indices.astype(int))

for num_neighbors in range(self.K, max_neighbors + 1): # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:num_neighbors].astype(int)], return_counts=True)

# Find the label with the highest frequency majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > num_neighbors / 2: return majority_label

# If there's still a tie after considering all neighbors, return the label of the single nearest neighbor return self.y_train[neighbor_indices[0].astype(int)]

def classify(self, x): """ Given a query point, return the predicted label

:param x: a query point stored as an ndarray """ # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K)

# Predict the label using the majority of the labels in the nearest neighbors return self.majority(neighbor_indices.astype(int), neighbor_distances)

def predict(self, X): """ Given an ndarray of query points, return yhat, an ndarray of predictions

:param X: an (m x p) dimension ndarray of points to predict labels for """ # Predict the label for each query point return np.array([self.classify(x) for x in X])

This is my fourth attempt code:

class KNN: """ Class to store data for regression problems """ def __init__(self, x_train, y_train, K=5): """ Creates a kNN instance

# Cache training labels and parameter K self.y_train = y_train self.K = K

:param neighbor_indices: The indices of the K nearest neighbors in self.X_train :param neighbor_distances: Corresponding distances from query point to K nearest neighbors. """

max_neighbors = self.K * 2 num_neighbors = self.K

while num_neighbors <= max_neighbors and num_neighbors <= len(neighbor_indices): # Count the frequency of each label in the neighbors labels, count = np.unique(self.y_train[neighbor_indices[:,:num_neighbors].astype(int)], return_counts=True)

# Find the label with the highest frequency majority_index = np.argmax(count) majority_label = labels[majority_index]

# If there's a clear winner, return it if count[majority_index] > num_neighbors/2: return majority_label

# Increase the number of neighbors and try again num_neighbors += 1

# If there's still a tie after trying all possible numbers of neighbors, return the label of the single nearest neighbor return self.y_train[neighbor_indices[0,0]]

def classify(self, x): """ Given a query point, return the predicted label

:param x: a query point stored as an ndarray """ # Find the K nearest neighbors in the training set neighbor_indices, neighbor_distances = self.balltree.query(x.reshape(1, -1), k=self.K)

# Predict the label using the majority of the labels

The error i got in all attempts:

AssertionError: Look at the KNN class. Does it perform as expected with three neighbors?

The question asks to "tie for the majority label using nearest neighbors, reduce by 1 and try again. Continue reducing until there is a winning label.", but it didn't work. Increasing also didn't work. All these functions are mandatory in the code class: __init__, majority, classify and predict.

I understand there's probably something wrong with my majority function in the KNN class. The function is not correctly handling cases where there is a tie for the majority label, but I don't know how to fix it. It seems like my code is having a problem when trying to use the model with 3 neighbors. It's an index problem? it's the vector? It's K value? It's the training and test data? I believe there's some tricky thing about 3 neighbors that I'm not seeing, cause the next task would be: Checking how well your classifier does Use your KNN class to perform KNN on the validation data with =3.

TO DO: What's wrong with my code? How can I fix this issue? Can somebody help me to make my model work with 3 neighbors?