Question

1 Approved Answer

Posted on Sep 11, 2024

Machine learning problem Since, I cant update the csv files here, I can send them by email or update to github Decision trees. Throughout the

Machine learning problem

Since, I cant update the csv files here, I can send them by email or update to github

Decision trees. Throughout the course, we usually rely on implementations of machine learning algorithms in Pythons scikit-learn library. This homework problem is very different: you are asked to implement the ID3 algorithm for building decision trees yourself. Refer to p. 56 in Mitchell for pseudocode of the ID3 algorithm that you are expected to imple- ment. You are not allowed to use sklearn or any other library with a built-in implementation of ID3.

Your program should read in a training dataset, a test dataset (both as csv files), and the name of the target variable (= classification attribute), and output to the screen:

your decision tree in a readable format (see below) its accuracy over the test set

Two pairs of sample datasets are available on Canvas, in the folder Files/homeworks/hw2, namely playtennis train.csv and playtennis test.csv, and republican train.csv and republi- can test.csv. The first line of the csv files contains the names of the fields. The target variable is not necessarily in the last column.

Name your Python program hw2.py. It should take command-line parameters for a file with training data, a file with test data, and the name of the target variable. In particular, it should run correctly when executing the following commands at the command-line:

python hw2.py playtennis_train.csv playtennis_test.csv playtennis

and

python hw2.py republican_train.csv republican_test.csv republican

The file hw2.py in Files/homeworks/hw2 contains starter code that visualizes a tree and computes accuracy. It will produce some output when run for the playtennis data (using the exact same command and arguments as written above). The output is nonsense, in the sense that the tree is hard coded and not constructed based on the data. You need to remove the tree = funTree() statement from the body of the id3(examples, target, attributes) function, and write a correct body for this function yourself. This is the only part of the starter code that you are expected to touch. You can of course introduce your own additional functions as you deem appropriate. Your final program hw2.py should also work for the republican data and for other, similarly structured, datasets.

Important notes:

Write your code in Python 3. Python 3 is the first version of Python in the history of the language to break backward compatibility. This means that code written for earlier versions of Python probably wont run on Python 3. I wont be able to run and grade your program if it is written in Python 2.x. Any version of Python 3.x should be fine.

I will not test your code on datasets with continuous-valued attributes. Your implemen- tation of ID3 can assume that all attributes are discrete-valued. You are expected to use information gain to guide the search for the best split attribute.