Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

# run this code to load and process the data import pickle, sklearn import matplotlib.pyplot as plt import pandas as pd import numpy as np

# run this code to load and process the data

import pickle, sklearn

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor, plot

_

tree

from sklearn.model

_

selection import train

_

test

_

split,KFold,GridSearchCV,LeaveOneOut

from sklearn.linear

_

model import LinearRegression, LogisticRegression

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from sklearn.metrics import r

2_

score,accuracy

_

score,classification

_

report

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

Sales

=

.

read

_

csv

("

NYC

_

Sales.csv

")

Sales

=

Sales.replace

(" - ",

.

nan

) .

dropna

() .

astype

({"

LAND

_

SQUARE

_

FEET":int,

"GROSS

_

SQUARE

_

FEET":int,

"YEAR

_

BUILT":int

})

Sales

["

SALE

_

PRICE

_

LOG"

] =

.

log

(

Sales

.

SALE

_

PRICE

)

Sales

["

GROSS

_

SQUARE

_

FEET

_

LOG"

] =

.

log

(

Sales

.

GROSS

_

SQUARE

_

FEET

)

Sales

["

LAND

_

SQUARE

_

FEET

_

LOG"

] =

.

log

(

Sales

.

LAND

_

SQUARE

_

FEET

)

Sales

=

Sales

[['

BOROUGH

',

'RESIDENTIAL

_

UNITS', 'COMMERCIAL

_

UNITS',

'YEAR

_

BUILT','SALE

_

PRICE

_

LOG', 'GROSS

_

SQUARE

_

FEET

_

LOG','LAND

_

SQUARE

_

FEET

_

LOG'

]]

ID:

cell

-

23

23

83

72

63

Read

-

only

Setting

This dataset contains a sample of the building or building unit

(

apartment

,

etc.

)

sold in the New York City property market over

12 -

month with a sales price higher than

100, 000

USD.

This dataset contains the following important properties of the building units sold.

BOROUGH: A digit code for the borough the property is located in; in order, these are

Manhattan

(1)

Bronx

(2)

Brooklyn

(3)

Queens

(4)

Staten Island

(5)

RESIDENTIAL

_

UNITS: Number of residential units in the property

COMMERCIAL

_

UNITS: Number of commercial units in the property

YEAR

_

BUILT: Year when the property was built

SALE

_

PRICE

_

LOG: log of the sales price

GROSS

_

SQUARE

_

FEET

_

LOG: log of the gross square feet

LAND

_

SQUARE

_

FEET

_

LOG: log of the land square feet

Question

1

Assume we want to use a decision tree model to predict SALE

_

PRICE

_

LOG. For the predictors, we want to use the following

'RESIDENTIAL

_

UNITS',

'COMMERCIAL

_

UNITS',

'YEAR

_

BUILT',

'GROSS

_

SQUARE

_

FEET

_

LOG',

'LAND

_

SQUARE

_

FEET

_

LOG'

For this question:

Split the dataset into training

(80 %) /

test set

(20 %)

with random

_

state

= 10 .

Use the training set to perform

10 -

fold cross

-

validation to tune one parameter for this decision model. Feel free to which parameter to tune and the range of values for the gridsearch.

For the model you found in the previous step, compute r

2_

score based on the function in sklearn documentation.

Question

2

Assume we want to use a linear regression model to predict SALE

_

PRICE

_

LOG. For the predictors, we want to use the following

'RESIDENTIAL

_

UNITS'

'COMMERCIAL

_

UNITS',

'YEAR

_

BUILT'

'GROSS

_

SQUARE

_

FEET

_

LOG'

'LAND

_

SQUARE

_

FEET

_

LOG'

For the question:

Split the dataset into training

(80 %) /

test set

(20 %)

with random

_

state

= 10 .

Train this model on the training set. On the testing set, compute r

2_

score based on the function here.

Pay attention to the interpretation of r

2_

score given here. Discuss whether your model in Q

1

or Q

2

is better.

Question

3

Assume we want to use a random forest model to predict BOROUGH. For the predictors, we want to consider the following

'RESIDENTIAL

_

UNITS'

'COMMERCIAL

_

UNITS',

'YEAR

_

BUILT'

'GROSS

_

SQUARE

_

FEET

_

LOG'

'LAND

_

SQUARE

_

FEET

_

LOG'

'SALE

_

PRICE

_

LOG'

For this question:

Split the dataset into training

(80 %) /

test set

(20 %)

with random

_

state

= 10 .

Use the training set to perform

10 -

fold cross

-

validation to tune a parameter for the random forest. Feel free to which parameter to tune and the range of values for the gridsearch.

For the model you found in the previous step, compute the following based on the testing set

The average accuracy

The name and the precision for the borough with the highest precision

The name and the recall for the borough with the highest recall

Question

4

Assume we want to use a multi

-

nomial logit model with L

2

penalty to predict BOROUGH. For the predictors, we want to use the following

'RESIDENTIAL

_

UNITS'

'COMMERCIAL

_

UNITS',

'YEAR

_

BUILT'

'GROSS

_

SQUARE

_

FEET

_

LOG'

'LAND

_

SQUARE

_

FEET

_

LOG'

'SALE

_

PRICE

_

LOG'

For this question:

Split the dataset into training

(80 %) /

test set

(20 %)

with random

_

state

= 10 .

Use the training set to perform

10 -

fold cross

-

validation to tune

inside range

(1, 11) .

Set max

_

iter

= 1000

for the regression.

For the model you found in the previous step, compute the following based on the testing set

The average accuracy

The name and the precision for the borough with the highest precision

The name and the recall for the borough with the highest recall

Question

5

Perform Q

4

again. This time, use StandardScaler and Pipeline to incorporate features standardization.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Fundamentals Of Database Systems

Authors: Sham Navathe,Ramez Elmasri

5th Edition

★★★★★

Provide examples of Dimensional Tables.

Answered: 1 week ago

Previous Question Next Question