Answered step by step
Verified Expert Solution
Link Copied!
Question
1 Approved Answer

The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information

 

The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"

Here is a brief description of each feature in the dataset:

  • SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • YearBuilt: Original construction date
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • BedroomAbvGr: Number of bedrooms above basement level
  • KitchenAbvGr: Number of kitchens
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • PoolArea: Pool area in square feet
  • MoSold: Month Sold
  • YrSold: Year Sold

I completed the code correctly for question 1a(Open the training dataset and remove all rows that contain at least one missing value (NA) & Return the new clean dataset and the number of rows in that dataset) but need help with the rest of the question. This is my code:

def clean_data():
import pandas as pd
data = pd.read_csv('house_prices_train.csv', index_col=0)
data_train = data.dropna()
nb_rows = data_train.shape[0]

return([nb_rows, data_train])

Question 1b:

For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):

  • The maximum sale price
  • The minimum garage area
  • The first quartile of lot area
  • The second most common year built
  • The mean of BedroomAbvGr

Hint: Use the built-in method describe() for a pandas.DataFrame

Here's the sample code i was given to start off:

def summary(data_train):
# Code goes here
# max_sale = maximum sale price in the training dataset
# min_garea = mining garage area
# fstq_lotarea = first quartile of lot area
# scd_ybuilt = second most common year built
# mean_bed = mean number of bedrooms above ground
### YOUR CODE HERE
return([max_sale, min_garea, fstq_lotarea, scd_ybuilt, mean_bed])

Question 1c:

Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)

Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.

################# Function to fit your Linear Regression Model ###################
def linear_regression_all_variables(data_train):
from sklearn import linear_model

# Code goes here
# dict_coeff = dictionnary (key = name of the variable, value = coefficient in the linear
# regression model)
# lreg = your linear regression model
###
### YOUR CODE HERE
###

return([dict_coeff, lreg])

Question 1d:

Refit the model on the training set using all the variables and return the RMSE on the test set.

(The first column "unnamed: 0" is not a variable)

################# Function to compute the Root Mean Squared Error ###################
def compute_mse_test(data_train, data_test):
from sklearn import linear_model, metrics

dict_coeff, lreg = linear_regression_all_variables(data_train)
###
### YOUR CODE HERE
###
# rmse = Root Mean Squared Error
return(rmse)

def linear_regression_all(data_train, data_test)

from sklearn import linear_model, metrics

#Code goes here

#rmse = root mean squared error of the second linear regression on the test dataset
###
### YOUR CODE HERE
###
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

return (rmse)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

The question appears to be incomplete because t... blur-text-image
Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Main Memory Database Systems

Authors: Frans Faerber, Alfons Kemper, Per-Åke Alfons

1st Edition

1680833243, 978-1680833249

More Books

Students explore these related Databases questions

Question

b. Where did they come from?

Answered: 3 weeks ago