Question
Multiple linear regression predicts the value of one dependent variable based on the values of more than one independent variable. In this instance, consistent notation
Multiple linear regression predicts the value of one dependent variable based on the values of more than one independent variable. In this instance, consistent notation becomes very important. In STATDISK, the multiple linear regression model (under the "analysis" and "multiple regression" drop down menus) is capable of employing the assistance of up to sixteen independent variables to predict the value of one dependent variable. This multiple linear regression model has the form:
y = b0+ b2x2+ b3x3+ b4x4+ b5x5+ b6x6+ b7x7+ b8x8+ b9x9+ b10x10+ b11x11+ b12x12+ b13x13+ b14x14+ b15x15+ b16x16+ b17x17
where
- y is the dependent variable selected by choosing the appropriate column value for "dependent variable column." For consistency purposes, I usually list the dependent variable in column 1.
- xi, for all i ? {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 , 14, 15, 16, 17}, are the independent variables selected by "select[ing] the columns to include in the regression analysis." Notice that the first column is also selected because the first column is also included in the regression analysis as the dependent variable.
- b0is the constant coefficient.
- biis the coefficient describing the strength of the independent variable value, xi.
It is not necessary to select all of the available independent variables! The model will have fewer xicoefficients when fewer columns are included in the regression analysis.
Using this model, several independent variable values may predict the value of one dependent variable. For instance, if b0= 7, b2= -2, b3= 3, b4= -1, b5= 8, b6= 123, b7= 2, b8= 1, b9= -5, b10= 1, b11= 4, b12= 6, b13= 9, b14= -13, b15= -4, b16= 11, and b17= 12, then the regression model is expressed as:
y = 7 - 2x2+ 3x3- x4+ 8x5+ 123x6+ 2x7+ x8- 5x9+ x10+ 4x11+ 6x12+ 9x13- 13x14- 4x15+ 11x16+ 12x17
If values of the independent variables are x2= 3, x3= -1, x4= 5, x5= 2, x6= 0, x7= -9, x8= -19, x9= -2, x10= 66, x11= 2, x12= 0, x13= 0, x14= 0, x15= 1, x16= 0, and x17= 0, then the linear regression equation predicts that the corresponding dependent variable value is:
y = 7 - 2(3) + 3(-1) - (5) + 8(2) + 123(0) + 2(-9) + (-19) - 5(-2) + (66) + 4(2) + 6(0) + 9(0) - 13(0) - 4(1) + 11(0) + 12(0) = 52
This means that if x2= 3, x3= -1, x4= 5, x5= 2, x6= 0, x7= -9, x8= -19, x9= -2, x10= 66, x11= 2, x12= 0, x13= 0, x14= 0, x15= 1, x16= 0, and x17= 0, then y = 52.
The multiple regression function in STATDISK is under the analysis menu. It behaves similarly to the correlation and regression function. The multiple regression function must be used when there is more than one independent variable predicting the dependent variable.
Background: Dummy Variables
A dummy variable (also known as an categorical variable) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect. Dummy variables are used as devices to sort data into mutually exclusive categories (such as the home has central air conditioning or does not, etc.). The numerical value of dummy variables are either 0 (usually for "no") or 1 (usually for "yes").
In some cases, there is a need for more than one dummy variable to sort data into mutually exclusive categories. For example, Hollister homes were sold during one of six time frames:
- 1998JAN-JUN,
- 1998JUL-DEC,
- 1999JAN-JUN,
- 1999JUL-DEC,
- 2000JAN-JUN, or
- 2000JUL-DEC.
Six mutually exclusive variables yield only only five degrees of freedom; therefore, only five dummy variables are present in the data. Specifically, 1998JAN-JUN is not present.
For example, if a home was sold in March, 2000, then
- 1998JUL-DEC = 0,
- 1999JAN-JUN = 0,
- 1999JUL-DEC = 0,
- 2000JAN-JUN = 1, and
- 2000JUL-DEC = 0.
Notice that if a home was sold in January, 1998, then
- 1998JUL-DEC = 0,
- 1999JAN-JUN = 0,
- 1999JUL-DEC = 0,
- 2000JAN-JUN = 0, and
- 2000JUL-DEC = 0
meaning that the home was not sold in any of the dummy variable categories present forcing the sale into the only remaining (unlisted) category of 1998JAN-JUN.
Question:
Using thedata_1998-2000_MREG.csv
download
dataset, find the multiple linear regression model predicting the sale price of Hollister homes and predict the sale price of a 1600 square foot, 5 year-old Hollister home on a 10000 square foot lot with central air conditioning, a pool, and a two-car garage sold in February, 2000 located in the Cerra Vista Elementary School District (CV).
Hint: After finding the multiple linear regression model including all sixteen independent variables, use the following values variables to predict the sale price of the home:
- 1998JUL-DEC = 0
- 1999JAN-JUN = 0
- 1999JUL-DEC = 0
- 2000JAN-JUN = 1
- 2000JUL-DEC = 0
- CAL = 0
- CV = 1
- GAB = 0
- ROH = 0
- SS = 0
- CENTRAL = 1
- POOL = 1
- GARAGE = 2
- SQUARE FOOTAGE = 1600
- AGE = 5
- LOT SIZE = 10000
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started