Question
using python Instructions: In the final project, you will do an end-to-end ML project from raw PubChem bioactivity data to ML or DL models that
using python
Instructions: In the final project, you will do an end-to-end ML project from raw PubChem bioactivity data to ML or DL models that can classify molecular bioactivity as active or inactive.
Milestone 1. Read and clean the data. 1. Read the data into a data frame. Skip rows that do not contain data. Use only the columns 'PUBCHEM_EXT_DATASOURCE_SMILES' and 'PUBCHEM_ACTIVITY_SCORE'. 2. Delete NaN and remove duplicate data in 'PUBCHEM_EXT_DATASOURCE_SMILES'.
Milestone 2. Describe the molecular structure and the binary classification. 1. Use MHFP (described in Lab 2) to calculate the fingprint and reformat it into a data frame that can be used as an X. 2. Convert "PUBCHEM_ACTIVITY_SCORE" to a binary activity and set it to y.
Milestone 3. After x and y are ready, prepare the data and get it ready for ML and DL. 1. do the training-test split 2. do data scaling
Milestone 4. machine learning. 1. Try at least three scikit-learn ML models and use cross-validation to select the best model. 2. Optimize the parameters of the best model using grid search.
Milestone 5. deep learning. 1. Build, compile, and fit a DNN and plot the learning curve (loss vs. epoch). 2. Vary the number of layers, the number of neurons per layer, the dimensionality of the MHFP, and/or the number of epochs to optimize the performance of the model.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started