Question

1 Approved Answer

Posted on Feb 29, 2024

Learning Objectives Identify classification learning algorithms in supervised learning paradigm Identify what is K-nearest neighbor (KNN) and how it works Identify what is logistic

Learning Objectives

Identify classification learning algorithms in supervised learning paradigm
Identify what is K-nearest neighbor (KNN) and how it works
Identify what is logistic regression and how it works
Apply KNN and logistic regression to build classifiers
Analyze and communicate analysis results by applying classifiers to learn from data
Reference Code

Below code, which is similar or identical to the lecture code, can be used in your assignment solutions. Please feel free to use them as you see appropriate. Note that you can reuse as much code as possible from the lecture notes. Indeed, I expect to reuse a lot of code, even if the code is not included below, from the lecture notes to approach the problems in this assignment.

import random
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#you can use the below style or something else
plt.style.use('classic')
plt.style.use('seaborn-whitegrid')

#z-scaling. The returned array should have mean 0 and std 1
def scaleAttrs(vals):
vals = np.array(vals)
mean = sum(vals)/len(vals)
sd = np.std(vals)
vals = vals - mean
return vals/sd

def accuracy(truePos, falsePos, trueNeg, falseNeg):
numerator = truePos + trueNeg
denominator = truePos + trueNeg + falsePos + falseNeg
return numerator/denominator
def sensitivity(truePos, falseNeg):
try:
return truePos/(truePos + falseNeg)
except ZeroDivisionError:
return float('nan')
def specificity(trueNeg, falsePos):
try:
return trueNeg/(trueNeg + falsePos)
except ZeroDivisionError:
return float('nan')
def posPredVal(truePos, falsePos):
try:
return truePos/(truePos + falsePos)
except ZeroDivisionError:
return float('nan')
def negPredVal(trueNeg, falseNeg):
try:
return trueNeg/(trueNeg + falseNeg)
except ZeroDivisionError:
return float('nan')

def getStats(truePos, falsePos, trueNeg, falseNeg, toPrint = True):
accur = accuracy(truePos, falsePos, trueNeg, falseNeg)
sens = sensitivity(truePos, falseNeg)
spec = specificity(trueNeg, falsePos)
ppv = posPredVal(truePos, falsePos)
if toPrint:
print(' Accuracy =', round(accur, 3))
print(' Sensitivity =', round(sens, 3))
print(' Specificity =', round(spec, 3))
print(' Pos. Pred. Val. =', round(ppv, 3))
return (accur, sens, spec, ppv)

def leaveOneOut(examples, method, toPrint = True):
truePos, falsePos, trueNeg, falseNeg = 0, 0, 0, 0
for i in range(len(examples)):
testCase = examples[i]
trainingData = examples[0:i] + examples[i+1:]
results = method(trainingData, [testCase])
truePos += results[0]
falsePos += results[1]
trueNeg += results[2]
falseNeg += results[3]
if toPrint:
getStats(truePos, falsePos, trueNeg, falseNeg)
return truePos, falsePos, trueNeg, falseNeg

def split80_20(examples):
sampleIndices = random.sample(range(len(examples)),
len(examples)//5)
trainingSet, testSet = [], []
for i in range(len(examples)):
if i in sampleIndices:
testSet.append(examples[i])
else:
trainingSet.append(examples[i])
return trainingSet, testSet

# method is a function that could be k-nn or logistic-regression
def randomSplits(examples, method, numSplits, toPrint = True):
truePos, falsePos, trueNeg, falseNeg = 0, 0, 0, 0
random.seed(0)
for t in range(numSplits):
trainingSet, testSet = split80_20(examples)
results = method(trainingSet, testSet)
truePos += results[0]
falsePos += results[1]
trueNeg += results[2]
falseNeg += results[3]
getStats(truePos/numSplits, falsePos/numSplits,trueNeg/numSplits, falseNeg/numSplits, toPrint)
return truePos/numSplits, falsePos/numSplits,trueNeg/numSplits, falseNeg/numSplits

def minkowskiDist(v1, v2, p):
"""Assumes v1 and v2 are equal-length arrays of numbers
Returns Minkowski distance of order p between v1 and v2"""
dist = 0.0
for i in range(len(v1)):
dist += abs(v1[i] - v2[i])**p
return dist**(1/p)

Data Scientists

Assume that we have a data-science club where data scientists meet and discuss data analysis and visualization. The members in the club are either paid accounts or unpaid accounts. You are provided a list of tuples. Each tuple contains three elements:

tenure, which is the number of years as a data scientist,
salary, which is how much the data scientist ears,
account, which is a number that is either 1 for a paid account or 0 for an unpaid account.

#do not change the below statementdata = [(0.7,48000,1),(1.9,48000,0),(2.5,60000,1),(4.2,63000,0),(6,76000,0),(6.5,69000,0),(7.5,76000,0),(8.1,88000,0),(8.7,83000,1),(10,83000,1),(0.8,43000,0),(1.8,60000,0),(10,79000,1),(6.1,76000,0),(1.4,50000,0),(9.1,92000,0),(5.8,75000,0),(5.2,69000,0),(1,56000,0),(6,67000,0),(4.9,74000,0),(6.4,63000,1),(6.2,82000,0),(3.3,58000,0),(9.3,90000,1),(5.5,57000,1),(9.1,102000,0),(2.4,54000,0),(8.2,65000,1),(5.3,82000,0),(9.8,107000,0),(1.8,64000,0),(0.6,46000,1),(0.8,48000,0),(8.6,84000,1),(0.6,45000,0),(0.5,30000,1),(7.3,89000,0),(2.5,48000,1),(5.6,76000,0),(7.4,77000,0),(2.7,56000,0),(0.7,48000,0),(1.2,42000,0),(0.2,32000,1),(4.7,56000,1),(2.8,44000,1),(7.6,78000,0),(1.1,63000,0),(8,79000,1),(2.7,56000,0),(6,52000,1),(4.6,56000,0),(2.5,51000,0),(5.7,71000,0),(2.9,65000,0),(1.1,33000,1),(3,62000,0),(4,71000,0),(2.4,61000,0),(7.5,75000,0),(9.7,81000,1),(3.2,62000,0),(7.9,88000,0),(4.7,44000,1),(2.5,55000,0),(1.6,41000,0),(6.7,64000,1),(6.9,66000,1),(7.9,78000,1),(8.1,102000,0),(5.3,48000,1),(8.5,66000,1),(0.2,56000,0),(6,69000,0),(7.5,77000,0),(8,86000,0),(4.4,68000,0),(4.9,75000,0),(1.5,60000,0),(2.2,50000,0),(3.4,49000,1),(4.2,70000,0),(7.7,98000,0),(8.2,85000,0),(5.4,88000,0),(0.1,46000,0),(1.5,37000,0),(6.3,86000,0),(3.7,57000,0),(8.4,85000,0),(2,42000,0),(5.8,69000,1),(2.7,64000,0),(3.1,63000,0),(1.9,48000,0),(10,72000,1),(0.2,45000,0),(8.6,95000,0),(1.5,64000,0),(9.8,95000,0),(5.3,65000,0),(7.5,80000,0),(9.9,91000,0),(9.7,50000,1),(2.8,68000,0),(3.6,58000,0),(3.9,74000,0),(4.4,76000,0),(2.5,49000,0),(7.2,81000,0),(5.2,60000,1),(2.4,62000,0),(8.9,94000,0),(2.4,63000,0),(6.8,69000,1),(6.5,77000,0),(7,86000,0),(9.4,94000,0),(7.8,72000,1),(0.2,53000,0),(10,97000,0),(5.5,65000,0),(7.7,71000,1),(8.1,66000,1),(9.8,91000,0),(8,84000,0),(2.7,55000,0),(2.8,62000,0),(9.4,79000,0),(2.5,57000,0),(7.4,70000,1),(2.1,47000,0),(5.3,62000,1),(6.3,79000,0),(6.8,58000,1),(5.7,80000,0),(2.2,61000,0),(4.8,62000,0),(3.7,64000,0),(4.1,85000,0),(2.3,51000,0),(3.5,58000,0),(0.9,43000,0),(0.9,54000,0),(4.5,74000,0),(6.5,55000,1),(4.1,41000,1),(7.1,73000,0),(1.1,66000,0),(9.1,81000,1),(8,69000,1),(7.3,72000,1),(3.3,50000,0),(3.9,58000,0),(2.6,49000,0),(1.6,78000,0),(0.7,56000,0),(2.1,36000,1),(7.5,90000,0),(4.8,59000,1),(8.9,95000,0),(6.2,72000,0),(6.3,63000,0),(9.1,100000,0),(7.3,61000,1),(5.6,74000,0),(0.5,66000,0),(1.1,59000,0),(5.1,61000,0),(6.2,70000,0),(6.6,56000,1),(6.3,76000,0),(6.5,78000,0),(5.1,59000,0),(9.5,74000,1),(4.5,64000,0),(2,54000,0),(1,52000,0),(4,69000,0),(6.5,76000,0),(3,60000,0),(4.5,63000,0),(7.8,70000,0),(3.9,60000,1),(0.8,51000,0),(4.2,78000,0),(1.1,54000,0),(6.2,60000,0),(2.9,59000,0),(2.1,52000,0),(8.2,87000,0),(4.8,73000,0),(2.2,42000,1),(9.1,98000,0),(6.5,84000,0),(6.9,73000,0),(5.1,72000,0),(9.1,69000,1),(9.8,79000,1),]

Problem 1: Plotting the Club Members

For this problem, I need to plot the data scientists in the data-science club so that we can conveniently visualize their tenured years, salaries, and paid accounts or not. Note that this problem was approached in an earlier assignment. I expect myself to explore the data while approaching the question.

Problem 2 Preparing for Building Learning Algorithms

For this problem, I need to write functions/class definition(s) to prepare building classifiers and testing them using various metrics including accuracy, sensitivity, specificity, positive predicative value, negative predictive value. In addition, I need to define a class DataScientist from which you can create data examples. Each example represents a data scientist in terms of its features including salary and tenured years. The class definition also needs to allow you label a data scientist to appropriately distinguish paid from unpaid accounts in the club.

After I complete the class DataScientist definition, I need to process the provided data into a list of data examples. (Each example is a data scientist.)

Problem 3 Logistic Regression

In this problem, I need to write function(s) to build a classifier using logistic regression algorithm. Then, I need to apply the test methods (leaveOneOut and randomSplit) to evaluate the learned classifier in terms of accuracy, sensitivity, specificity, and positive predicative value.

The output of the evaluation on the logistic-regression classifier you built could be similar to the below, assuming the salary and tenured year values are scaled to values that have mean 0 and std 1 and the probability threshold is set as 0.3.

To scale the values, you can apply the function scaleAttrs. I can also apply code that defines the test methods and measures. The function definitions are provided above in the Reference Code section. I recommend you to review the lecture if you need to figure what and how they can applied in a classification system.

More Hint: In addition to the code provided in this document, i will need to reference the code in lecture that is used to build a model using logistic regression, apply the model, and evaluate the model.

Average of 10 80/20 splits LR
Accuracy = 0.857
Sensitivity = 0.881
Specificity = 0.849
Pos. Pred. Val. = 0.664
Average of LOO testing using LR
Accuracy = 0.865
Sensitivity = 0.846
Specificity = 0.872
Pos. Pred. Val. = 0.698

Problem 4 KNN

In this problem, I need to write function(s) to build a classifier using KNN algorithm. Then, I need to apply the test methods (leaveOneOut and randomSplit) to evaluate the learned classifier in terms of accuracy, sensitivity, specificity, and positive predicative value.

More Hint: In addition to the code provided in this document, I will need to reference the code in lecture that is used to define KNN, apply KNN, and evaluate the classification result.

The output of the evaluation on the KNN classifier I built could be similar to the below:

Average of 10 80/20 splits using KNN (k=3)
Accuracy = 0.848
Sensitivity = 0.574
Specificity = 0.94
Pos. Pred. Val. = 0.763
Average of LOO testing using KNN (k=3)
Accuracy = 0.865
Sensitivity = 0.596
Specificity = 0.959
Pos. Pred. Val. = 0.838

Loan

In the following problems, I will analyze a set of loan data points. Each data point is presented as a row in the data file (loan_data.csv). Each row contains the customer data including id, outcome, dti, borrower_score, and payment_inc_ratio. The loan outcome should be used to label the data point for your classifier. I should not use id as a feature for the feature vectors.

Problem 5 Preparing for Building Learning Algorithms

For this problem, I need to write functions/class definition(s) to prepare building classifiers and testing them using various metrics including accuracy, sensitivity, specificity, positive predicative value, negative predictive value. In Problem 2, I am also required to define the metric functions. So, I can reuse the functions I defined there. In addition, I need to define a class Customer from which you can create data examples. Each example represents a Customer in terms of its features. The class definition also needs to allow me to label a customer appropriately based on the loan outcome "paid off" or "default".

After I complete the class Customer definition, I need to process the provided data into a list of data examples. (Each example is a customer.)

Again, I would emphasize that I can reuse as much code as possible from the lecture notes and share the code to solve the problems in this assignment.

Problem 6 Logistic Regression

In this problem, I need to write function(s) to build a classifier using logistic regression algorithm. Then, I need to apply the test methods (leaveOneOut and randomSplit) to evaluate the learned classifier in terms of accuracy, sensitivity, specificity, and positive predicative value.

Additionally, I need to plot the ROC curve and compute the AUC score to evaluate the classifier.

Problem 7 Summary Writeup

For this problem, I am expected to reflect on the classifiers I built on the two data sets (data scientists and loan customers). I need to address the below questions:

While building the classifiers using KNN and logistic regression, how do I think the classifiers? Any one is better? And why?