Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

# run this code to load and process the data import pickle, sklearn import matplotlib.pyplot as plt import pandas as pd import numpy as np

# run this code to load and process the data
import pickle, sklearn
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor, plot_tree
from sklearn.model_selection import train_test_split,KFold,GridSearchCV,LeaveOneOut
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import r2_score,accuracy_score,classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Sales=pd.read_csv("NYC_Sales.csv")
Sales=Sales.replace("-",np.nan).dropna().astype({"LAND_SQUARE_FEET":int,
"GROSS_SQUARE_FEET":int,
"YEAR_BUILT":int })
Sales["SALE_PRICE_LOG"]=np.log(Sales.SALE_PRICE)
Sales["GROSS_SQUARE_FEET_LOG"]=np.log(Sales.GROSS_SQUARE_FEET)
Sales["LAND_SQUARE_FEET_LOG"]=np.log(Sales.LAND_SQUARE_FEET)
Sales=Sales[['BOROUGH', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
'YEAR_BUILT','SALE_PRICE_LOG', 'GROSS_SQUARE_FEET_LOG','LAND_SQUARE_FEET_LOG']]
ID:
cell-a23d23c83fc72e63
Read-only
Setting
This dataset contains a sample of the building or building unit (apartment, etc.) sold in the New York City property market over 12-month with a sales price higher than 100,000 USD.
This dataset contains the following important properties of the building units sold.
BOROUGH: A digit code for the borough the property is located in; in order, these are
Manhattan (1)
Bronx (2)
Brooklyn (3)
Queens (4)
Staten Island (5)
RESIDENTIAL_UNITS: Number of residential units in the property
COMMERCIAL_UNITS: Number of commercial units in the property
YEAR_BUILT: Year when the property was built
SALE_PRICE_LOG: log of the sales price
GROSS_SQUARE_FEET_LOG: log of the gross square feet
LAND_SQUARE_FEET_LOG: log of the land square feet
Question 1
Assume we want to use a decision tree model to predict SALE_PRICE_LOG. For the predictors, we want to use the following
'RESIDENTIAL_UNITS',
'COMMERCIAL_UNITS',
'YEAR_BUILT',
'GROSS_SQUARE_FEET_LOG',
'LAND_SQUARE_FEET_LOG'
For this question:
Split the dataset into training(80%)/test set (20%) with random_state=10.
Use the training set to perform 10-fold cross-validation to tune one parameter for this decision model. Feel free to which parameter to tune and the range of values for the gridsearch.
For the model you found in the previous step, compute r2_score based on the function in sklearn documentation.
Question 2
Assume we want to use a linear regression model to predict SALE_PRICE_LOG. For the predictors, we want to use the following
'RESIDENTIAL_UNITS'
'COMMERCIAL_UNITS',
'YEAR_BUILT'
'GROSS_SQUARE_FEET_LOG'
'LAND_SQUARE_FEET_LOG'
For the question:
Split the dataset into training(80%)/test set (20%) with random_state=10.
Train this model on the training set. On the testing set, compute r2_score based on the function here.
Pay attention to the interpretation of r2_score given here. Discuss whether your model in Q1 or Q2 is better.
Question 3
Assume we want to use a random forest model to predict BOROUGH. For the predictors, we want to consider the following
'RESIDENTIAL_UNITS'
'COMMERCIAL_UNITS',
'YEAR_BUILT'
'GROSS_SQUARE_FEET_LOG'
'LAND_SQUARE_FEET_LOG'
'SALE_PRICE_LOG'
For this question:
Split the dataset into training(80%)/test set (20%) with random_state=10.
Use the training set to perform 10-fold cross-validation to tune a parameter for the random forest. Feel free to which parameter to tune and the range of values for the gridsearch.
For the model you found in the previous step, compute the following based on the testing set
The average accuracy
The name and the precision for the borough with the highest precision
The name and the recall for the borough with the highest recall
Question 4
Assume we want to use a multi-nomial logit model with L2 penalty to predict BOROUGH. For the predictors, we want to use the following
'RESIDENTIAL_UNITS'
'COMMERCIAL_UNITS',
'YEAR_BUILT'
'GROSS_SQUARE_FEET_LOG'
'LAND_SQUARE_FEET_LOG'
'SALE_PRICE_LOG'
For this question:
Split the dataset into training(80%)/test set (20%) with random_state=10.
Use the training set to perform 10-fold cross-validation to tune
inside range(1,11). Set max_iter=1000 for the regression.
For the model you found in the previous step, compute the following based on the testing set
The average accuracy
The name and the precision for the borough with the highest precision
The name and the recall for the borough with the highest recall
Question 5
Perform Q4 again. This time, use StandardScaler and Pipeline to incorporate features standardization.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Fundamentals Of Database Systems

Authors: Sham Navathe,Ramez Elmasri

5th Edition

B01FGJTE0Q, 978-0805317558

More Books

Students also viewed these Databases questions

Question

Provide examples of Dimensional Tables.

Answered: 1 week ago