Question

1 Approved Answer

Posted on Sep 22, 2024

# kNN implementation in Python 3 import csv import random import math import operator import urllib.request # Old Way Commented-Out ### FUNCTION GET LINES: Get

# kNN implementation in Python

3 import csv import random import math import operator import urllib.request # Old Way Commented-Out ### FUNCTION GET LINES: Get the files, either directly online or by saving it locally: def getLines(filename): lines = [] if (filename.startswith(('http', 'ftp', 'sftp')) ): # Skip downloading it and open directly online: response = urllib.request.urlopen(filename) lines = csv.reader(response.read().decode('utf-8').splitlines()) else: # TutorialsPoint IDE requires 'r', not 'rb' with open(filename, 'r') as csvfile: # csvreader is an object that is essentially a list of lists csvreader = csv.reader(csvfile) for line in csvreader: lines.append(line) return lines def loadDatasetFinal(filename, split, trainingSet=[] , testSet=[]): lines = getLines(filename) #for row in lines: # print (', '.join(row)) dataset = list(lines) for x in range(len(dataset)-1): for y in range(4): # legend = ('Sepal length', 'Sepal width', 'Petal length', 'Petal width') dataset[x][y] = float(dataset[x][y]) if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # set our parameters k = 3 split = 0.67 filename = 'iris.data' filename = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' # prepare data trainingSet=[] testSet=[] loadDatasetFinal(filename, split, trainingSet, testSet) print('Train set: ' + repr(len(trainingSet))) print('Test set: ' + repr(len(testSet))) # generate predictions predictions=[] for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) accuracy = getAccuracy(testSet, predictions) print('Accuracy: ' + repr(accuracy) + '%') main() Running the example, you will see the results of each prediction compared to the actual class value in the test set. At the end of the run, you will see the accuracy of the model. In this case, a little over 98%: ... > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' > predicted='Iris-virginica', actual='Iris-virginica' Accuracy: 98.0392156862745%

Questions 1. On different runs, you'll get different percentage values. Can you explain why?

2. Can you add a wrapper to this function so that it computers multiple runs (a user-specified value) and computes an average accuracy over the multiple runs rather than a single accuracy?

3. Also, please implement one of the following possible extensions: 1. Regression: You could adapt the implementation to work for regression problems (predicting a real-valued attribute). The summary of the closest instances could involve taking the mean or the median of the predicted attribute. 2. Normalization: When the units of measure differ between attributes, it is possible for attributes to dominate in their contribution to the distance measure. For these types of problems, you will want to rescale all data attributes into the range 0-1 (i.e., normalization) before calculating similarity. Update the model to support data normalization. Please do feel free to include some of the approaches we've already implemented. 3. Alternative Distance Measure: There are many distance measures available, as we've seen, and you can even develop your own domain-specific distance measures if you like. Implement an alternative distance measure, such as Manhattan distance or the vector dot product.

4. Two additional ideas for extension include support for distance-weighted contribution of the k-most similar instances to the prediction and more advanced tree-based data structures for searching for similar instances. Please implement one of these and show its results in comparison to the regular kNN above.

5. You can also utilize some open source implementations of kNN in popular machine learning libraries to analyze the Iris dataset. Some of the more popular ones you might want to use are Implementation kNN in scikit-learn (https://github.com/scikit-learn/scikitlearn/tree/master/sklearn/neighbors) and Implementation of kNN in Weka (https://github.com/baron/weka/blob/master/weka/src/main/java/weka/classifiers/lazy/IBk.java) (unofficial). Please implement this version and check how it compares to our manual implementation above.

Please do hand these in as a Word document with the code and screenshots included.