Question
1. Move code to a library Jupyter Notebooks are not good for managing code. They are best for visualization and quick iteration. So, we'll move
1. Move code to a library
Jupyter Notebooks are not good for managing code. They are best for visualization and quick iteration. So, we'll move useful code to a library. Later, we can import the library and call its methods from the notebooks. Useful things to move to the library:
1. Data preprocessing: load from csv, clean up data.
2. Dataset split.
3. Metrics and score calculation.
Steps:
1. Create a Python package foloder1 . It's just a subdirectory, with an empty file __init__.py in it.
2. Create features.py in the csc665 subdirectory and implement the following functions:
A. def train_test_split(X, y, test_size, shuffle, random_state=None) : X, y - features and the target variable. test_size - between 0 and 1 - how much to allocate to the test set; the rest goes to the train set. shuffle - if True, shuffle the dataset, otherwise not. random_state, integer; if None, then results are random, otherwise fixed to a given seed. Example: X_train, X_test, y_train, y_test = train_test_split(feat_df, y, 0.3, True, 12)
B. create_categories(df, list_columns) Converts values, in-place, in the columns passed in the list_columns to numerical values. Follow the same approach: "string" -> category -> code. Replace values in df, in-place.
C. X, y = preprocess_ver_1(csv_df) Apply the feature transformation steps to the dataframe, return new X and y for entire dataset. Do not modify the original csv_df . Remove all rows with NA values Convert datetime to a number Convert all strings to numbers. Split the dataframe into X and y and return these. 3. Create metrics.py : A. def mse(y_predicted, y_true) - return Mean-Squared Error. B. def rmse(y_predicted, y_true) - return Root Mean-Squared Error. C. def rsq(y_predicted, y_true) - return .
Copy the in-class notebook, and replace all relevant data processing and feature calculations with your own functions.
Evaluate the score as a function of the number of trees (i.e. n_estimators) used in the random forest. That is, set the number of trees to 1, 5, 10, 20, ..., 200, and measure on both the train and test sets. Plot both train and test scores as the function on the number of trees used in the random forest model. Analyze the result. Is it overfitting / underfitting? How overfitting / underfitting changes with the number of tre
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started