Question
Question 1 In this question you investigate how well nightly price can be predicted from the other variables in the dataframe. You need to decide
Question 1
In this question you investigate how well nightly price can be predicted from the other variables in the dataframe. You need to decide yourself which variables to choose, but make sure you have at least 10 variables. The only requirement is that you use room_type as a predictor. Because room_type is a categorical variable, you first have to use dummy coding to turn it into a number of binary variables (hint: pd.get_dummies()). In the notebook, provide a short explanation for your choice of variables.
Starting from the variables you have chosen, our goal is to derive a sparse model with fewer variables. This process is called variable selection. In variable selection ('variable' means the same as 'predictor'), variables get iteratively added or removed from the regression model. Once finished, the model typically contains only a subset of the original variables. It makes it easier to interpret the model, and in some cases it makes it generalise better to new data. To perform variable selection, implement a function
variable_selection(df, predictors, target, alpha)
where df is the listings dataframe, predictors is a list with your initial selection of at least 10 variables as follow ['neighbourhood_cleansed','property_type','room_type','accommodates','bedrooms','beds','minimum_nights','maximum_nights','availability_365','number_of_reviews','review_scores_location']
target is the target variable for the regression (e.g. 'price'), and alpha is the significance level for selecting significant predictors (e.g. 0.05). The function returns pred, the selected subset of the original predictors.
To calculate regression fits and p-values you can use statsmodels. Your approach operates in two stages: In stage 1, you build a model by adding variables one after the other. You keep adding variables that increase the adjusted R2 coefficient. You do not need to calculate it by hand, it is provided by statsmodels package. In stage 2, starting from these variables, if any of them are not significant, you keep removing variables until all variables in the model are significant. The output of the second stage is your final set of variables. Let us look at the two stages in detail:
Stage 1 (add variables)
?? Start with an empty set of variables
?? Fit multiple one-variable regression models. In each iteration, use one of the variables provided in predictors. The variable that leads to the largest increase in adjusted R2 is added to the model.
?? Now proceed by adding a second variable into the model. Starting from the remaining variables, again choose the variable that leads to the largest increase in adjusted R2.
?? Continue in the same way for the third, fourth, ... variable.
?? You are finished when there is no variable left that increases adjusted R2.
Stage 2 (remove non-significant variables)
It is possible that some of the variables from the previous stage are not significant. We call a variable "significant" if the p-value of its coefficient is smaller or equal to the given threshold alpha.
?? Start by fitting a model using the variables that have been added to the model in Stage 1.
?? If there is a variable that is not significant, remove the variable with the largest p-value and fit the model again with the reduced set of variables.
?? Keep removing variables and re-fitting the model until all remaining variables are significant.
?? The remaining significant variables are the output of your function.
To test your function, add a function call with your selection of predictors and alpha level.
YOU ARE ONLY ALLOWED TO USE BELOW PYTHON LIBRARIES/PACKAGES
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started