Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

SHOW STEP BY STEP HOW TO DO IN XLMINER Boston Housing This dataset contains information collected by the US Census Bureau concerning housing prices in

SHOW STEP BY STEP HOW TO DO IN XLMINER
Boston Housing
This dataset contains information collected by the US Census Bureau concerning housing prices in
the area of Boston, Massachusetts. The dataset has 506 cases.
For the classification exercise, we have created a new variable CAT.MEDV that assigns a class
based on the median value of the tract. The class definition is as follows:
1= less than or equal to $15,000(Low value)
2= less than or equal to $30,000 but more than $15,000(Medium value)
3= More than $30,000(High value)
The following steps are recommended:
1. Start XLMiner
2. Open the Boston Housing file.
3. Partition the data (you can use the defaults)
4. Run classification tree, using the Prune tree with validation data option. Use the
"normalize data" option, and set both the "Min. # of records in a terminal node" and the
"Max. # of levels displayed in tree drawing" to 5. Use the output options specified below.
The key ingredient in a tree algorithm is the splitting of the data according to the value of a
variable, and for the XLMiner tree algorithm the split for normalized data is the same as
that for non-normalized data.
a. Full Tree
b. Best Pruned Tree
Study the results and answer the following questions:
Questions:
1. What is the error rate for the validation data, using the best pruned tree?
2. Study the best pruned tree and develop a classification rule, using if-then statements.
3. What is the difference in terms of number of nodes between the best pruned tree and full
tree?
4. If you were to apply these trees to new data from the same source as these data, how
would you expect the error rate for the full tree to compare to the error rate for the best
pruned tree?
5. Supposing the local government decided to levy tax based on the predicted class of
houses and the tax rate is fixed as:
Class 1= $250
Class 2= $500
Class 3= $800
2/3
Estimate the overall net gain or loss the government will make due to misclassification
(based on the misclassification of the validation data.) Use the Assessment sheet in
BostonHousing.xls file to explain your computation.
Charles Book Club
Identify the customers to whom you will send the mailers, using as a cut-off criterion a 20%
probability of a purchase, as predicted by a logistic regression model. In other words, you would
propose mailing to any customer who has a predicted probability of purchase >=0.20(as predicted
by your logistic regression models). Build 3 models by selecting variables as follows (partitioning
50% to training, 30% to validation, and 20% to test; we will not be using the test partition right
now):
a. The full set of 15 predictors as independent variables and "Florence" as the dependent
variable,
b. A best subset selection of six variables (note: the best subset selection process in
XLMiner is similar to that for multiple linear regression for subset selection)
c. Only the R, F, M variables.
(Note: Once you have generated XLMiner output LR_ClassifyValid sheet, you can update the
classification of records by modifying the cut-off criterion field. Verify this (for illustrative
purposes) by sorting the data in descending order of values in the "Predicted Prob. of
Success" field in order to locate the cut-off value of 0.2 or 20%. You will need this sort to
answer question 2 below.)
2. Indicate the trial that produced better results and explain why.
Tabulate the results of the three experiments conducted above and report which experiment
produced the best result. Tabulate your conclusion in the Assessment table below.
Assessment Table
Srl Evaluation Technique Target
Customer
Nos.
Actual
Response
Nos.
Response
Rate (%)
1. If we sent mailers to all the customers in
validation data set.
2. Logistics Regression: cut-off rate of 20%
probability of success.
a. The full set of 15 predictors as
3/3
dependent variables and "Florence"
as the independent variable,
b. Best subset selection of six
variables, including constant*
c. Only the R, F, M variables.
*If you wish, you can use the "best subset" option in Logistic Regression. When you implement it,
should set "size of best subset" to "8" and "# of best subsets" to "1"(this will produce one best
subset for each size of 8 and smaller i.e. one best subset with 8 variables, one with 7, one with
six, etc.). After you run logistic regression with best subset selected, the summary output page will
give you information you can use in choosing which variables to use. You then need to run logistic
regression again, using those variables.
3. For the best model, indicate the lift ratio for first 250 cases.
Note: The lift ratio is calculated as follows:
1. Sort by predicted probability of success
2. For the first 250 rows, determine response rate ("actuals" divided by 250)
3. Divide this response rate by the response rate obtained by mailing to everyone (#1 above).
4. This is the "lift" obtained by mailing to 250 cases.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions