Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Preamble: what the data is about The dataset you have for this test relates to protein localisation sites for yeast. The data contains 10 columns:

Preamble: what the data is about

The dataset you have for this test relates to protein localisation sites for yeast. The data contains 10 columns:

1. Sequence Name: Accession number for the SWISS-PROT database

2. mcg: McGeochs method for signal sequence recognition.

3. gvh: von Heijnes method for signal sequence recognition.

4. alm: Score of the ALOM membrane spanning region prediction program.

5. mit: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.

6. erl: Presence of HDEL substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.

7. pox: Peroxisomal targeting signal in the C-terminus.

8. vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.

9. nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

10. Localisation Site (see below)

The names and distribution of classes, i.e. localisation sites (column 10), are detailed below:

The dataset is available in Blackboard in the folder for this assessment. The data file is named

.

Task 1: Data Mining

In this task you are required to classify the data into one of the ten classes using a decision tree. When splitting your data into training and test data and for your classification process use a seed of

then classify the data using training data and report statistics for your test data. You have the following 4 sub-tasks:

(a) Use a 70-30 split to create your training and test data.

(b) Use your training data to train a model.

(c) Use your model to predict previously unseen data using the test data.

(d) Produce a confusion matrix showing your predictions and report the accuracy of your model.

Please note: Do not use a seed of 1234, your solutions are not clear described or any error.

Task 2: Visualization

This task requires you to produce appropriate visualizations of your classification and results.

(a) Produce a visualization of your classification model and how it makes decisions, when using a 70-30 split. You may change the size of the plotting window in RStudio by using:

{r, fig.width=X, fig.height=Y} , where X and Y are numbers, so as to avoid nodes and labels in the tree to be overlapped.

(b) Produce a visualization of your confusion matrix as a heatmap. Your heatmap should visualize the predicted variables and normalize these predictions between 0 and 1. Do this task by using the 70-30 split, and use the ggplot packages to produce the heatmap visualization

We are considering:

That you have visualized the correct model.

Your skills to create visualizations that are easy to interpret and understand.

The quality of your visualizations.

That the scale of the heatmap is correct, or have not used ggplot for producing the heatmap.

Task 3: Data Analysis

For the final task of the project, consider what you have done for this project and reflect on your work.

(a) Write two or three paragraphs to explain:

The results reported in the confusion matrix, with respect to true and false positives.

If you think the classifier you have created are acceptable in terms of their effectiveness.

Why you think the model has made the predictions it did: reflect on this especially with respect to the distributions of class variables.

(b) In the above you have used eight variables (attributes from column 2 to column 9) as the independent variables of the target variable. In this sub-question, you are required to consider how to remove some independent variables to make the decision tree simple. Usually, we expect that the dimensionality reduction does not reduce the performance significantly.

Provide a method to remove a few variables (at least two, more is better) from column 2 to column

9. You should provide a justification to answer why your solution is acceptable.

Use a 70-30 split to create a new training model by using your selected independent variables and the target variable localisation sites.

Discuss your experimental results (i.e., confusion matrix) against the results in Task 2 (b).

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Information Modeling And Relational Databases

Authors: Terry Halpin, Tony Morgan

2nd Edition

0123735688, 978-0123735683

More Books

Students also viewed these Databases questions

Question

Recognize the power of service guarantees.

Answered: 1 week ago