Question

1 Approved Answer

Posted on Aug 25, 2024

Machine Learning - Banknote Authentication (PYTHON ONLY)For this assignment, we will make use of a set of data containing 5 different attributes (see below) extracted

Machine Learning - Banknote Authentication

(PYTHON ONLY)For this assignment, we will make use of a set of data containing 5 different attributes (see below) extracted from pictures taken of paper money. In each case, it was known whether the money was real or counterfeit. This property is indicated by the last attribute in each row as a 1 (real) or 0 (counterfeit). The attributes are as follows (you do not need to understand what the first 4 attributes mean):

1. variance of Wavelet Transformed image 2. skewness of Wavelet Transformed image 3. curtosis of Wavelet Transformed image 4. entropy of the image 5. class (integer -- 1 is real, 0 is counterfeit)

We are going to build a tool that will be trained on this data to be able to determine, from a new picture, whether this picture is of fake or real money. This technique is a machine learning technique called "Classification". There are many different classification algorithms which are much more complex that the one that we will use here. If you are interested, take a look at this article. In fact, there are whole courses on machine learning techniques.

There are three phases to this project:

1. Preparing the data 2. Building the classifier 3. Testing the classifier on new data and determining its accuracy

Preparing the Data

The data that we will use for this project is found on the web. The data is composed of (approximately) 1/2 counterfeit samples and 1/2 real samples. These are indicated by the 5th attribute in each row (1 (real) or 0 (counterfeit)). After reading the data from the website (using code), you will make two files on your local machine from this data. The first will be called "training.txt" and the second will be called "testing.txt". (Please do not put an absolute path when you create these files -- just use something like outFile = open("training.txt", "w") -- which will write the file to the same location as where your .py files reside. This will eliminate the need for the TAs to make any changes to your code in order to run it on their own local machine.

To read the data from the website, you may find the code in web_scraper.py from assignment #1 useful. Please note that the data is found at https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt. It seems that many people are getting errors when using the https protocol. When you put this link into your code, if you change https to http, this should eliminate the error.

When separating the data into the two files, you want to have, for each file, an even mix of real and counterfeit samples. You may NOT physically count the rows to divide them up, you must do this via code. So, perhaps the first half of the counterfeit samples and the first half of the real samples go into "training.txt" and the second half of the counterfeit samples and the second half of the real samples go into "testing.txt".

Building the Classifier

Each data sample has 4 attributes (excluding the last one that indicates the classification of the sample). So, the data looks something like this:

 [ [2, 4, 6, 8, 0], [4, 6, 8, 10, 0], [1, 3, 5, 7, 1] [3, 5, 7, 9, 1]]

To build the classifier you will use the data in "training.txt":

Calculate the average of each of the attributes across all the samples with the same classification (0 or 1). For the data shown above, the averages for each attribute across the counterfeit samples (0) are [3, 5, 7, 9] and for those that are real (1) the averages are [2, 4, 6, 8].
Find the midpoints between the averages for the 2 groups by adding the average of the counterfeit and the average of the real samples and divide the result by two. This will be done for each of the attributes. So, for the data shown, the midpoints would be [2.5, 4.5, 6.5, 8.5 ]. The midpoints are what we will use as our classifier.

Test the Classifier

In this phase we will use the test set of data "testing.txt" that you created to determine how well your classifier works on data that it has not seen before.

We will check the four attributes against our classifier and rate each attribute as follows:

Attribute 1: if the value >= classifier value for this attribute, then this attribute is classified as 0 (fake) - otherwise, real

Attribute 2: if the value >= classifier value for this attribute, then this attribute is classified as 0 (fake) - otherwise, real

Attribute 3: if the value <= classifier value for this attribute, then this attribute is classified at 1 (real) - otherwise, fake

Attribute 4: if the value >= classifier value for this attribute, then this attribute is classified as 0 (fake) - other wise, real

If there are more real than fake classifications, then we will classify the entire sample as "real". Otherwise, we'll classify as "fake". If there is a tie -- well, you decide what to do in this case.

Determine the accuracy of your classifier by testing each of the samples in "testing.txt" and comparing your classification with the actual classification (as shown by the 5th attribute in each row). The accuracy should be expressed as a percentage -- that is, if your classifier got 70 samples correct out of 120, you would determine the percent accuracy by multiplying 70 by 100 and dividing the result by 120.

Your program must be written in a modular format much like assignment 1. We suggest that you have 4 modules; one for your main program (this will really only contain main()), one for all input/output operations (reading the data from the website, creating the text files on your local machine, reading the content of the files), one for building the classifier and one for testing the classifier.

You should use "if __name__ == "__main__":" and show the testing of each of the functions in each module.

You may assume that you know the number of attributes that you are dealing with (or you can make your program general to handle any number of attributes).

Your program should not need any input from the user. It should inform the user of the accuracy of the classifier. Be sure that the user understands what your output represents.