Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Q2. Is it a conference announcement? The DBWorld e-mails data set contains 64 e-mails collected from DBWorld mailing list, classified into two classes: conference announcements
Q2. Is it a conference announcement? The DBWorld e-mails data set contains 64 e-mails collected from DBWorld mailing list, classified into two classes: "conference announcements" and "everything else". The data has 64 instances and 4702 attributes (note that the number of instances is much smaller than the number of attributes). Each word in the vocabulary of the e-mail collection defines an attribute. Study the description of the data set and plan how to convert the data file into a plain csv file that you can read into an ndarray in sklearn. You may want to consider the loadarff reader in scipy. b) Learn a MultinomialNB classification model on the dataset. Use 3-fold cross validation to evaluate the performance of the classifier c) Read the overview of Ensemble methods and use a Bagging classifier built with the MultinomialNB classifier as the base estimator. The Bagging classifier is quite powerful, as it allows sampling both instances of the labelled data, as well as features (attributes). Experiment with different numbers of base estimators, numbers of samples to draw to train each base estimate, and numbers of features to draw to train each base estimator. Evaluate your choices of hyperparameters using 3-fold cross validation. Use the default values for the remaiing BaggingClassifier hyperparameters S. d) Summarize your findings from parts (b), and (c). Which classifier and hyperparameter values per- formed best? Q2. Is it a conference announcement? The DBWorld e-mails data set contains 64 e-mails collected from DBWorld mailing list, classified into two classes: "conference announcements" and "everything else". The data has 64 instances and 4702 attributes (note that the number of instances is much smaller than the number of attributes). Each word in the vocabulary of the e-mail collection defines an attribute. Study the description of the data set and plan how to convert the data file into a plain csv file that you can read into an ndarray in sklearn. You may want to consider the loadarff reader in scipy. b) Learn a MultinomialNB classification model on the dataset. Use 3-fold cross validation to evaluate the performance of the classifier c) Read the overview of Ensemble methods and use a Bagging classifier built with the MultinomialNB classifier as the base estimator. The Bagging classifier is quite powerful, as it allows sampling both instances of the labelled data, as well as features (attributes). Experiment with different numbers of base estimators, numbers of samples to draw to train each base estimate, and numbers of features to draw to train each base estimator. Evaluate your choices of hyperparameters using 3-fold cross validation. Use the default values for the remaiing BaggingClassifier hyperparameters S. d) Summarize your findings from parts (b), and (c). Which classifier and hyperparameter values per- formed best
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started