Answered step by step
Verified Expert Solution
Question
1 Approved Answer
1 . Download the DryBeanDataSet 8 7 4 . xlsx dataset. The dataset contains 1 3 6 1 1 instances, 2 0 descriptive features, and
Download the DryBeanDataSetxlsx dataset. The dataset contains instances, descriptive
features, and the class feature Class in column U
Without changing anything in the provided dataset, provide an analytics base table wherein you characterize all of the features of the dataset.
You now have to very carefully explore the dataset to identify data quality issues. For this part of your report, only identify the data quality issues and provide justications for these issues. One of the data quality issues is that some of the class labels are missing. Excude this data quality issue from the discussion.
Based on your analysis above, decide on two dierent machine learning approaches that you will employ to construct a predictive model for this problem. Give justications for why you have selected these two approaches for this problem.
For this part of the assignment, only focus on the data quality issues with respect to the descriptive features. For each of the machine learning approaches, discuss the datapreprocessing steps that you have implemented to optimally transform the dataset for that specic machine learning approach and to correct data quality issues. Note: do not do unnecessary data transformations. Carefully think about the data transformations needed for your selected machine learning algorithms. Provide justications for each of these preprocessing steps. Should you decide not to address a data quality issue, justify this decision. When you preprocess the dataset, make sure that you do not change the order of the instances in the dataset.
For this part of the assignment, only use the instances that have a known class label. Develop the two predictive models and evaluate the performance of the two models. Make sure to construct optimal congurations of your chosen models both with respect to architecture and values for control parameters. Describe the process that you have followed to produce an optimal conguration for each model. For this purpose, carefully decide on the performance metrics that you will use. Conclude on which one of the two approaches is best for this problem, and support your conclusion with justications. For the purposes of this assignment, make sure to report the performance based on a kfold crossvalidation. Decide on the number of folds with a justication.
For the last part of the assignment, focus returns to those instances that have a missing class label. Make use of knearest neighbour to impute a class label for each of these instances. Describe how you have used knearest neighbours for this purpose. You have to decide on the value of k with justication. In a table list the instance number and the imputed class label. Then, for your best model identied above, retrain the model on the new datasets with the imputed class labels. Report on the performance of the model, compared to the results obtained from step above and conclude on the ecacy of the knearest neighbour selflabeling process.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started