Question

1 Approved Answer

Posted on Sep 22, 2024

In this homework assignment, you are to develop at least four machine learning models (including at least the following four: Decision Trees, Random Forest, Artificial

image text in transcribed

In this homework assignment, you are to develop at least four machine learning models (including at least the following four: Decision Trees, Random Forest, Artificial Neural Networks - MLP, Logistic Regression, but you can add more model types), using the Income dataset and KNIME Analytics Platform. Here are the specifics to include in your analytics process: Intimately understand and critically explore the data. You can use tables and charts/graphs to perform this task - remember, Data Explorer node in KNIME is an excellent resource for this task. If you see missing values, perform missing value imputation (show your screenshot). Performs row and column filters, if and when necessary. Use the Color Manager for better visualization of rows/tables and tree structures. Show your complete final workflow picture for ease of assessment. Properly balance the data (you can show the results with and without data balancing). Evaluate the output of each model type with both the Scorer and the ROC nodes. Also, at the end, you need to have a table where you can compare different models on Accuracy, Sensitivity, Specificity, and ROC value. Compare the models' performances by combine the model outputs (using the Column Appender node) into a single ROC chart. Show he first three levels of the decision tree graphical model (the whole tree may be too large to fit into the page with legibility). Based on the decision tree splits, briefly comment on the top variables (i.e., variable importance). Produce a Variable Importance graph using Random Forest variable statistics. In the homework report, follow "General Guidelines for Written Reports or Assignments", and use the six steps of CRISP-DM methodology as major headings to structure your professionally organized report. When you are writing and formatting your homework assignment, make sure to follow the general reporting guidelines. Good luck! D. Delen 1 Appendix - Data Specifics The income data is extracted from 1994 US Census database. Data set includes 32,561 rows and 15 columns. The meaning of the variables (i.e., the data dictionary) is given below: 1. Age: the age of an individual (Integer). 2. Workclass: a general term to represent the employment status of an individual (Nominal). 3. FinalWeight: final weight for data sampling. This is the number of people the census believes the entry represents (Integer). 4. Education: the highest level of education achieved by the individual (Nominal). 5. EducationNum: the highest level of education achieved in numerical form (Integer). 6. MaritalStatus: marital status of the individual (Nominal). 7. Occupation: the general type of occupation of the individual (Nominal). 8. Relationship: represents what the relationship of the individual to others (Nominal). 9. Race: Descriptions of an individual's race (Nominal). 10. Gender: the biological gender of the individual (Nominal). 11. CapitalGain: capital gains for an individual (Integer). 12. CapitalLoss: capital loss for an individual (Integer). 13. HoursPerWeek: the number of hours the individual has reported to work per week (Integer). 14. NativeCountry: country of origin for the individual (Nominal). 15. Incomelevel: whether the individual makes more than $50,000 or not, annually (Binomial). This is usually the target variable for the predictive modeling