Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

The objectives of this assignment are the following: Use / implement a feature selection / reduction technique. Experiment with various classification models. Think about dealing

The objectives of this assignment are the following:

Use

/

implement a feature selection

/

reduction technique.

Experiment with various classification models.

Think about dealing with imbalanced data.

Use F

1

Scoring Metric

Detailed Description

Develop predictive models that can determine given a particular compound whether it is active

(1)

or not

(0) .

Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active

(

binding

)

compounds from the inactive

(

non

-

binding

)

ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug

(

solubility

,

oral absorption

,

lack of side effects, appropriate duration of action, toxicity, etc.

) .

The goal of this competition is to allow you to develop predictive models that can determine given a particular compound whether it is active

(1)

or not

(0) .

As such, the goal would be develop the best binary classification model.

A molecule can be represented by several thousands of binary features which represent their topological shapes and other characteristics important for binding.

Since the dataset is imbalanced the scoring function will be the F

1 -

score instead of Accuracy.

Caveats:

+

Remember not all features will be good for predicting activity. Think of feature selection, engineering, reduction

(

anything that works

)

+

The dataset has an imbalanced distribution i

.

.,

within the training set there are only

78

actives

(+ 1)

and

722

inactives

(0) .

No information is provided for the test set regarding the distribution.

+

Use your data mining knowledge learned till now wisely to optimize your results.

Data Description

The training dataset consists of

800

records and the test dataset consists of

350

records. We provide you with the training class labels and the test labels are held out. The attributes are binary type and as such are presented in a sparse matrix format within train.dat and test.dat

Train data: Training set

(

a sparse binary matrix, patterns in lines, features in columns: the index of the non

-

zero features are provided with class label

1

0

in the first column

) .

Test data: Testing set

(

a sparse binary matrix, patterns in lines, features in columns: the index of non

-

zero features are provided

) .

Format example: A sample submission with

350

entries randomly chosen to be

0

1 .

Use python, to build the code with the description given. Increase the accuracy by using only decision tree classifier or naive bayes theorem?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Data And Databases

Authors: Jeff Mapua

1st Edition

★★★★★

Describe Table Structures in RDMSs.

Answered: 1 week ago

Previous Question Next Question