The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information

The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"

Here is a brief description of each feature in the dataset:

SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
YearBuilt: Original construction date
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
BedroomAbvGr: Number of bedrooms above basement level
KitchenAbvGr: Number of kitchens
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
PoolArea: Pool area in square feet
MoSold: Month Sold
YrSold: Year Sold

I completed the code correctly for question 1a(Open the training dataset and remove all rows that contain at least one missing value (NA) & Return the new clean dataset and the number of rows in that dataset) but need help with the rest of the question. This is my code:

def clean_data():
import pandas as pd
data = pd.read_csv('house_prices_train.csv', index_col=0)
data_train = data.dropna()
nb_rows = data_train.shape[0]

return([nb_rows, data_train])

Question 1b:

For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):

The maximum sale price
The minimum garage area
The first quartile of lot area
The second most common year built
The mean of BedroomAbvGr

Hint: Use the built-in method describe() for a pandas.DataFrame

Here's the sample code i was given to start off:

def summary(data_train):
# Code goes here
# max_sale = maximum sale price in the training dataset
# min_garea = mining garage area
# fstq_lotarea = first quartile of lot area
# scd_ybuilt = second most common year built
# mean_bed = mean number of bedrooms above ground
### YOUR CODE HERE
return([max_sale, min_garea, fstq_lotarea, scd_ybuilt, mean_bed])

Question 1c:

Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)

Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.

################# Function to fit your Linear Regression Model ###################
def linear_regression_all_variables(data_train):
from sklearn import linear_model

# Code goes here
# dict_coeff = dictionnary (key = name of the variable, value = coefficient in the linear
# regression model)
# lreg = your linear regression model
###
### YOUR CODE HERE
###

return([dict_coeff, lreg])

Question 1d:

Refit the model on the training set using all the variables and return the RMSE on the test set.

(The first column "unnamed: 0" is not a variable)

################# Function to compute the Root Mean Squared Error ###################
def compute_mse_test(data_train, data_test):
from sklearn import linear_model, metrics

dict_coeff, lreg = linear_regression_all_variables(data_train)
###
### YOUR CODE HERE
###
# rmse = Root Mean Squared Error
return(rmse)

def linear_regression_all(data_train, data_test)

from sklearn import linear_model, metrics

#Code goes here

#rmse = root mean squared error of the second linear regression on the test dataset
###
### YOUR CODE HERE
###
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

return (rmse)