Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Assignment for module 6 In this assignment, you are required to implement a document classifier using Nave Bayes algorithm with your favorite programming language. You

Assignment for module 6 In this assignment, you are required to implement a document classifier using Nave Bayes algorithm with your favorite programming language. You will use the provided training data to train the classifier to predict the owner of the document, and you will test your classifier using the provided testing data. Your submitted implementation should contain the basic parts such as: reading the input file, computing the prior probabilities given the training data, computing the likelihood of terms shown in each category given the training data, and predicting the owner for each document in the testing data and save it as an output file. Data set description The dataset used in this project (i.e., News-messages.zip) contains the messages and documents collected from three different categories, i.e., Car, Computer hardware, and Sport. After downloading and extracting the file, you will find two folders that contain the training data (under folder Training) and testing data (under folder Testing). There are 5883 documents in total (including training and testing data). There are 3516 documents in the training dataset and 2367 documents in the testing dataset. In both training and testing dataset, each category is stored in a subdirectory, with each article stored as a separate file. Therefore, all articles under the folder Training/Car are the documents or messages related to cars. If you want to know more details about these documents, all of these files could be read by using notepad (for Windows machines), TextEdit (for Mac OS), or Gedit (for Linux/Ubuntu). A framework for the Naive Bayes Classifier We have provided a framework for you. The framework is written with python 2.7.3. You could find the two additional files on Canvas, i.e., main.py and submit.py. ? main.py: The supporting functions for the Naive Bayes classifier, such as reading the input file, converting the data and writing the results to file. You DO NOT need to modify this file, any changes made within this file will NOT be considered. ? submit.py: The place where you implement the Naive Bayes classifier. Please do NOT change the file name, or the function names. If you need to create new sub function, feel free to do so within the submit.py file. The structure of the provided framework ? In the training stage, the input parameter is a dictionary called training_set. Each document in the Train folder has been converted to lower case, and the punctuations have been removed. The document is stored under its own document ID (i.e., the filename) ? We use P_C, a python dictionary to store the probabilities of the categories observed in the training data. ? We use P_w_C, a python dictionary or dictionaries to store the probabilities of the terms in each category from the training data. How to run the framework Please make sure the main.py and submit.py are in the same folder. You could simply run the main function as follow: python main.py This would require that News-messages in the same folder as the main.py, and it will generate the output file named output.txt. Alternatively, you could specify the path to the training folder, testing folder, and name of the output file using the T, -t, -o option in the command line. For more details, you could try: python main.py T PATH/TO/TRAINING t PATH/TO/TESTING o OUTPUT_FILE If you are not familiar with python You could choose your preferred programming language to finish the task. Your submitted implementation should contain the basic components such as: reading the input file, compute the probability of the categories from the training data, compute the probability of the terms in each category from the training data, infer the category of the document from the testing data set, and other supporting functions you may need. You should explain your implementation by either including comments in the code, or a section in your report. Your classifier should be able to generate a predicted owner for each document in the testing dataset. It should be stored in a tab-separated two column file, with the first column as the filename (including the folder name), and the second column as the predicted owner of the file. For example: Testing/Car/103007-Car Car Testing/Car/103008-Car Car Testing/Sport/53879-Sport Computer_hardware Please note that you are not allowed to use any pre-implemented categorization functions to finish the problem. Directly calling a pre-implemented categorization algorithm from any package would result in 0 points for this question. What to submit for this problem: ? Your implementation of this problem. ? How to run your code. ? The output file from your classifier. Name it as output.txt.

Start submit.py:

import math

import operator

def training(training_set):

"""

This is the place you inplement the training stage of Naive Bayes classifier.

The input parameters is training_set: a dictionary contains all the training documents.

The dict is indexed by the document id. Each document is represented as a dictionary

with two keys, i.e., category and content. The category is the known category

of the document, while the words of the document is stored in content.

Every word of the content have been changed to lower case. The punctuations

have been removed from the content.

For example, you could get the detailed information of document 101551 (one of

the document in Car category under the training set) as following:

The content:

content = training_set["101551"]["content"]

and the category:

category = training_set["101551"]["category"]

Both the content and the category of one particular document are strings.

You need to return two variables in this function:

P_C: The probability of each category shown in the training set. The P_C should be

a dictionary. The keys are the name of the categories, and the values are the

probabilities of the corresponding category. For example:

P_C["Computer_hardware"]=0.5

P_w_C: The probabilities of each word shown in each category in the training set.

The P_w_C should be a dictionary of dictionaries. The first key should be

the word, while the second key should be category, and the value should be the

probability of the word shown in the category. For example:

P_w_C["the"]["Computer_hardware"]=0.008

These two parameters have been declared for you. Please do not remove these lines.

"""

P_C={} # Please do not remove this line

P_w_C={} # Please do not remove this line

# Begin your implementation after this line

###############################################

################################################

# Finish your implementation before this line

return P_C, P_w_C # Please do not remove this line

def infer(document, P_C, P_w_C):

"""

This is the place where you implement the inferring stage of Naive Bayes classification.

There are three input parameters, which are:

document: A string that contains only the content of the document need to be classified.

The content has been converted to lower case, and the punctuations have been removed

before feed into this function.

P_C: The same parameter you generated in the training stage. i.e., it contains the

probability of each category shown in the training set. The P_C is a dictionary.

The keys are the name of the categories, and the values are the probabilities

of the corresponding category. For example:

P_C["Computer_hardware"]=0.5

P_w_C: The same parameter you generated in the training stage.

The probabilities of each word shown in each category in the training set.

The P_w_C is a dictionary of dictionaries. The first key should be

the word, while the second key should be category, and the value should be the

probability of the word shown in the category. For example:

P_w_C["the"]["Computer_hardware"]=0.008

The return value of this function should be the inferred category of the document. The

return value should be string. The return variable has been declared for you. Please do

not remove it.

"""

inferred_category = "" # Please do not remove this line

# Begin your implementation after this line

###############################################

################################################

# Finish your implementation before this line

return inferred_category # Please do not remove this line

End submit.py:

Start main.py:

import math

import traceback

import argparse

import os

import string

import operator

from sys import argv

from submit import training,infer

parser = argparse.ArgumentParser()

parser.add_argument("-T", "--train", nargs='?', default="News-messages/Training/",

help="Path to the Folder of the training sets. Default is News-messages/Training/")

parser.add_argument("-t", "--test", nargs='?', default="News-messages/Testing/",

help="Path to the Folder of the testing sets. Default is News-messages/Testing/")

parser.add_argument("-o","--output", nargs='?', default="output.txt",

help="Output file name. Default is output.txt")

args = parser.parse_args()

train_folder = args.train

test_folder = args.test

output_file = args.output

if not (os.path.isdir(train_folder) and os.path.isdir(test_folder)):

print "[Error] The training and/or testing folder does not exists!"

exit()

category_count={}

word_count_in_category={}

training_set={}

def main():

try:

print "[Info] Reading documents from training set ..."

file_counter = 0

for root, dirs, files in os.walk(train_folder):

if len(dirs) != 0:

continue

else:

for file in files:

file_counter += 1

content=""

with open(os.path.join(root,file),"r") as inputfile:

category=os.path.basename(root)

for line in inputfile:

content+=cleanline(line)

training_set[file]={}

training_set[file]["content"]=content

training_set[file]["category"]=category

print "[Info] " + str(file_counter) + " documents loaded for training."

print "[Info] Training stage started."

P_C, P_w_C = training(training_set)

print "[Info] Training stage is done."

print "[Info] Start inferring stage."

performance={}

output={}

for root, dirs, files in os.walk(test_folder):

if len(dirs) != 0:

continue

else:

correct = 0

wrong = 0

for file in files:

content=""

correct_category = os.path.basename(root)

if correct_category not in performance:

performance[correct_category]={}

with open(os.path.join(root,file),"r") as inputfile:

for line in inputfile:

content+=cleanline(line)

predict_category = infer(content,P_C, P_w_C)

output[os.path.join(root,file)]=predict_category

if correct_category == predict_category:

correct += 1

else:

wrong += 1

performance[correct_category]['correct'] = correct

performance[correct_category]['wrong'] = wrong

print "[Info] Inferrring stage is done."

print "[Info] Generating output file."

generate_output(output,output_file)

print "[Info] Output file generated."

overall_c = 0

overall_w = 0

for cat in performance:

print "In category %s: \tCorrectly classified %d, %d documents are wrong." \

% (cat, performance[cat]["correct"],performance[cat]["wrong"])

overall_w += performance[cat]["wrong"]

overall_c += performance[cat]["correct"]

print "Overall correctness: %.4f" % (overall_c/float(overall_c+overall_w))

except Exception as e:

print "[Error] There is an error: "

print traceback.print_exc()

def generate_output(output,output_file):

with open(output_file,"w") as of:

for key in output:

of.write(key+"\t"+output[key]+" ")

def cleanline(line):

# Remove all punctuation from the string

line = line.translate(None, string.punctuation)

# Change every word in the line to lower case

line = line.lower()

return line

if __name__ == '__main__':

main()

end main.py

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

Explain the Credit Crisis and Impact on Insurance Companies

Answered: 1 week ago