Question

1 Approved Answer

Posted on Sep 25, 2024

The total number of points for this assignment is 120 points. Please submit your assignment in a Word file. Use this assignment file as a

The total number of points for this assignment is 120 points. Please submit your assignment in a Word file. Use this assignment file as a template to enter and copy-paste your answers for your assignment submission. Include both the Python code and the results. Keep the problem descriptions and insert your answers after each question. Please name your assignment with this format: Lastname.Firstname.Assignment1.

1. (10 points) Download the BostonHousing.xls file and read the data description. The target attribute in this dataset is the median value of the homes, denoted MEDV. In Excel, delete the CAT.MEDV attribute (which is a binary attribute converted from the MEDV attribute). Then, save the remaining data to a CSV file (called say, BostonHousing.csv).

a. Build a regression tree model (with the target attribute MEDV) and draw the tree. Follow the steps in the reg-trees-salary example (but you do not need to do data preprocessing). Set min_samples_leaf=50 so that the tree is small enough to be printed on a page. Also set random_state=1. Keep the other default parameters unchanged.

b. Evaluate regression tree model using 5-fold cross validation. Use Scikit-Learns cross_val_score() function (set cv=5). Report Average RMSE and average MAE.

2. (5 points) Build a linear regression model on the BostonHousing dataset (with the target attribute MEDV). Follow the steps in the linear-reg-salary example.

3. (25 points) Open the AussieCredit.arff file with Notepad or WordPad and read the data description. This is a real-world credit evaluation dataset. Due to confidentiality concerns, the names and values of the attributes were disguised, and the two class values are represented by plus (+) and minus (). Develop a decision tree on this dataset. Follow the steps in the decision-trees-weather-missing1 example, but data preprocessing is more involved for this problem due to many missing values and rare values. Specifically,

a. There are 6 values in 4 attributes that each occurs no more than 7 times (count<=7): A4: 'l' (lower case L); A5: 'gg'; A6: 'r'; A7 'dd', 'n', 'o' Please treat them as missing values and replace them with the mode of the attribute.

b. Impute missing numeric values using the mean of the attribute.

c. Convert categorical attributes to dummies using one hot encoding.

d. For each raw and transformed data frame, use .info() to provide data description.

e. Set min_samples_leaf=2 and random_state=1 when building decision tree, but do not draw/display the tree.

f. Evaluate decision tree model using a holdout test set instead of cross validation. Set test_size=0.3 and random_state=1 when using train_test_split(). Report testing accuracy and confusion matrix.

4. (15 points) Open the CongressVote.arff file with Notepad or WordPad and read the data description. This dataset has many missing values, labeled by ?. Our task is to build a Nave Bayes model to classify each instance (i.e., a House member) to a democrat or republican based on his/her voting records. The steps for this problem is similar to (but not exactly the same as) those in the naive-bayes-fraud-detect example.

a. Convert categorical attributes to dummies using ordinal encoding by considering missing value label '?' as another category. As a result, there are 3 categories for each attribute (i.e., ?, n, and y) and no missing values. Since a ? is likely to represent that the member abstained the vote, it is more appropriate to treat it as a third category rather than imputing it with the mode.

b. Build a Naive Bayes model using all the data as training data and evaluate the model also using the training data (i.e., do not split the data into training/test set). Report training accuracy only (no confusion matrix).

c. Verify that if '?' in the original data is treated as a missing value (NaN), then it must be imputed before model building. Do not try to impute the data, but display the error message from the output when you attempt to build the model (ValueError:).

5. (15 points) Download the BostonHousing2.xls file and read the data description. The dataset in the FullDatasheet (506 instances) is taken from the BostonHousing.xls file used in Problems 1 and 2. The target attribute is CATMEDV, which is a binary attribute converted from MEDV (which was removed). Within Excel, save the FullData sheet as a CSV file. The dataset in the SmallData sheet includes the first 10 instances of the FullData and a subset of the original predictors. Save it as another CSV file. Perform k-NN classification on these two datasets. Follow the similar steps in the knn-Admission example.

a. Perform k-NN classification on the SmallData (10 instances). Classify the 6th instance (row 7, highlighted in the SmallData spreadsheet) using 1-NN (k = 1) based on the other 9 instances. Does 1-NN classify the instance correctly?

b. Now, work on the FullData (506 instances). Perform a 10-fold cross validation without data leakage for 5-NN classification (k = 5). Use the following parameters: StratifiedKFold(n_splits=10, random_state=1, shuffle=True).

6. (10 points) Perform a 10-fold cross validation with logistic regression on the AussieCredit dataset (used in Problem 3).

b. After part (a), perform a 10-fold cross validation with logistic regression using Pipeline. Follow the steps in the logistic-reg-pipeline-weather-missing4 example. But do not try to do this problem without Pipeline; do not try grid search; and do not specify penalty='none' when initiating LogisticRegression(); that is, use this initiation: LogisticRegression(random_state=1)

7. (20 points) Perform 10-fold cross validation with SVM classification (SVC) and regression (SVR) on the Boston housing data, using Pipeline and grid search. Follow the steps in the svc-Admission and svr-salary1 examples respectively, but do not try plotting.

a. Perform a 10-fold cross validation with SVC on the BostonHousing2 dataset (used in Problem 5) using Pipeline. Specifically, (i) report average accuracy, confusion matrix, precision, recall, and F1 score; and (ii) use grid search to find the best C from C = [1, 5, 10, 50, 100, 500, 1000].

b. Perform a 10-fold cross validation with SVR on the BostonHousing dataset (used in Problems 1 and 2) using Pipeline. Specifically, (i) report average RMSE and MAE; and (ii) use grid search to find the best combination of C and from C = [1, 5, 10, 50, 100, 500, 1000] and = [0.05, 0.1, 0.15, 0.2], based on the neg_mean_squared_error criterion (by specifying scoring='neg_mean_squared_error' when initiating GridSearchCV()).

8. (20 points) Perform 10-fold cross-validation evaluation using Pipeline for (i) decision trees, (ii) k-NN, (iii) logistic regression (iv) SVC, and (v) naive Bayes on the AussieCredit dataset (used in Problems 3 and 6).

b. Use the following parameters for the above four classifiers:

Impute missing numeric values with mean and missing categorical values with mode.
For decision trees, k-NN, logistic regression and SVC, use MinMaxScaler() for normalizing numeric features and OneHotEncoder() for encoding categorical features.
For nave Bayes, use KBinsDiscretizer(n_bins=5, encode='ordinal') for binning numeric features and OrdinalEncoder() for encoding categorical features.
For decision trees, set min_samples_leaf=2, random_state=1.
For k-NN, set n_neighbors=1.
For logistic regression, set penalty='l2', solver='lbfgs', random_state=1.
For SVC, set C=1, kernel='linear'.