Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Login Required user id password Login Forgotten your password? ECONOMICS 2123B - ECONOMETRICS II LECTURE 5: SPECIFICATION OF REGRESSION VARIABLES George Orlov Department of Economics

Login Required user id password Login Forgotten your password? ECONOMICS 2123B - ECONOMETRICS II LECTURE 5: SPECIFICATION OF REGRESSION VARIABLES George Orlov Department of Economics Western University Due Credit and Copyright These slides are by no means a substitute for the textbook readings and represent a modified version of Christopher Dougherty's slides. Original citation: Dougherty, C. (2012) EC220 - Introduction to Econometrics (Chapter 6). [Teaching Resource] 2012 The Author This version available at: http://learningresources.lse.ac.uk/132/ Available in LSE Learning Resources Online: May 2012 This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License. This license allows the user to remix, tweak, and build upon the work even for commercial purposes, as long as the user credits the author and licenses their new creations under the identical terms. http://creativecommons.org/licenses/by-sa/3.0/ Overview of Today's Lecture In this lecture block we will look at adding qualitative variables into the model and interpreting their results. We will look at the following concepts: Omitted Variable Bias Proxy variables Testing linear restrictions (F-tests and t-tests) VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 In this section and the next we will investigate the consequences of misspecifying the regression model in terms of explanatory variables. 1 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X2, or it depends on both X2 and X3. 2 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Correct specification, no problems Y b1 b2 X 2 b3 X 3 If Y depends only on X2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. 3 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 Correct specification, no problems Correct specification, no problems Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the multiple regression. 4 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. 7 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E (b2 ) 2 3 X X X X X X 2i 2 3i 3 2 2i 2 In the present case, the omission of X3 causes b2 to be biased by the term highlighted in yellow. We will explain this first intuitively and then demonstrate it mathematically. 8 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E (b2 ) 2 3 X X X X X X 2i 2 3i 3 2 2i 2 Y effect of X3 direct effect of X2, holding X3 constant X2 2 3 apparent effect of X2, acting as a mimic for X3 X3 The intuitive reason is that, in addition to its direct effect 2, X2 has an apparent indirect effect as a consequence of acting as a proxy for the missing X3. 9 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E (b2 ) 2 3 X X X X X X 2i 2 3i 3 2 2i 2 Y effect of X3 direct effect of X2, holding X3 constant X2 2 3 apparent effect of X2, acting as a mimic for X3 X3 The strength of the proxy effect depends on two factors: the strength of the effect of X3 on Y, which is given by 3, and the ability of X2 to mimic X3. 10 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E (b2 ) 2 3 X X X X X X 2i 2 3i 3 2 2i 2 Y effect of X3 direct effect of X2, holding X3 constant X2 2 3 apparent effect of X2, acting as a mimic for X3 X3 The ability of X2 to mimic X3 is determined by the slope coefficient obtained when X3 is regressed on X2, the term highlighted in yellow. 11 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u 2 X 2 i X 2 3 X 3 i X 3 ui u We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Yi about its sample mean. It can be expressed in terms of the deviations of X2, X3, and u about their sample means. 12 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u 2 X 2 i X 2 3 X 3 i X 3 ui u b2 X X Y Y X X X X X X X X X X u X X X X X X X X u u X X X X 2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only. The slope coefficient is therefore as shown. 13 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u 2 X 2 i X 2 3 X 3 i X 3 ui u b2 X X Y Y X X X X X X X X X X u X X X X X X X X u u X X X X 2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 We substitute for the Y deviations and simplify. 14 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u Yi Y 1 2 X 2 i 3 X 3 i ui 1 2 X 2 3 X 3 u 2 X 2 i X 2 3 X 3 i X 3 ui u b2 X X Y Y X X X X X X X X X X u X X X X X X X X u u X X X X 2i 2 i 2 2i 2 2 2 2i 2 3 2i 2 3i 3 2i 2 i u 2 2i 2i 2 2 3i 3 3 2 2i 2 i 2 2i 2 2 2i 2 Hence we have demonstrated that b2 has three components. 15 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u b2 2 3 X X X X X X u u X X X X 2i 2 3 2i 2 i 2 2i E b2 2 3 3i 2 2 2i 2 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 To investigate biasedness or unbiasedness, we take the expected value of b2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term. 16 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 X 2 i X 2 ui u 1 E E X 2 i X 2 ui u 2 2 X 2i X 2 X 2i X 2 1 E X 2 i X 2 ui u 2 X 2i X 2 1 X 2 i X 2 E ui u 2 X 2i X 2 0 X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation. 17 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 X 2 i X 2 ui u 1 E E X 2 i X 2 ui u 2 2 X 2i X 2 X 2i X 2 1 E X 2 i X 2 ui u 2 X 2i X 2 1 X 2 i X 2 E ui u 2 X 2i X 2 0 In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule). 18 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 X 2 i X 2 ui u 1 E E X 2 i X 2 ui u 2 2 X 2i X 2 X 2i X 2 1 E X 2 i X 2 ui u 2 X 2i X 2 1 X 2 i X 2 E ui u 2 X 2i X 2 0 In each product, the factor involving X2 may be taken out of the expectation because X2 is nonstochastic. 19 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i 2 2 2i 2 X 2 i X 2 ui u 1 E E X 2 i X 2 ui u 2 2 X 2i X 2 X 2i X 2 1 E X 2 i X 2 ui u 2 X 2i X 2 1 X 2 i X 2 E ui u 2 X 2i X 2 0 By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the error term is 0. 20 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i E b2 2 3 2 2 2i 2 X X X X X X 2i 2 3i 3 2 2i 2 Thus we have shown that the expected value of b2 is equal to the true value plus a bias term. Note: the definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated. 21 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y b1 b2 X 2 Y 1 2 X 2 3 X 3 u E b2 2 3 X X X X E X X u u X X X X 2i 2 3i 3 2i 2 i 2 2i E b2 2 3 2 2 2i 2 X X X X X X 2i 2 3i 3 2 2i 2 As a consequence of the misspecification, the standard errors, t tests and F test are invalid. 22 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using a subsample of EAEF data from before. 23 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 ) 2 3 ASVABC ASVABC SM SM ASVABC ASVABC i i 2 i We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC. 24 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 ) 2 3 ASVABC ASVABC SM SM ASVABC ASVABC i i 2 i It is reasonable to suppose, as a matter of common sense, that 3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant. 25 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 . cor SM ASVABC Number of obs = 540 (obs=540) F( 2, 537) = 147.36 Prob > F SM = 0.0000 | ASVABC R-squared = 0.3543 --------+-----------------Adj R-squared = 0.3519 SM| 1.0000 Root 0.4202 MSE = 1.963 ASVABC| 1.0000 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ E (b2 ) 2 3 ASVABC ASVABC SM SM ASVABC ASVABC i i 2 i The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive. 26 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ Here is the regression omitting SM. 27 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 -----------------------------------------------------------------------------. reg S ASVABC -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. 28 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S SM S 1 2 ASVABC 3 SM u Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ E (b3 ) 3 2 ASVABC ASVABC SM SM SM i i SM 2 i Here is the regression omitting ASVABC instead of SM. We would expect b3 to be upwards biased. We anticipate that 2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive. 29 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg S ASVABC SM S 1 2 ASVABC 3 SM u -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1328069 .0097389 13.64 0.000 .1136758 .151938 SM | .1235071 .0330837 3.73 0.000 .0585178 .1884963 _cons | 5.420733 .4930224 10.99 0.000 4.452244 6.389222 -----------------------------------------------------------------------------. reg S SM -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------SM | .3130793 .0348012 9.00 0.000 .2447165 .3814422 _cons | 10.04688 .4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while 2 and 3 are similar in size, judging by their estimates. 30 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S 1 2 ASVABC 3 SM u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R2 is 0.34, and in the simple regression of S on SM it is 0.13. 31 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S 1 2 ASVABC 3 SM u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not 0.47. 32 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE S 1 2 ASVABC 3 SM u . reg S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1135.67473 2 567.837363 Residual | 2069.30861 537 3.85346109 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 147.36 0.0000 0.3543 0.3519 1.963 . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 . reg S SM Source | SS df MS -------------+-----------------------------Model | 419.086251 1 419.086251 Residual | 2785.89708 538 5.17824736 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 80.93 0.0000 0.1308 0.1291 2.2756 In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power. 33 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP LGEARN 1 2 S 3 EXP u Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. 34 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP LGEARN 1 2 S 3 EXP u Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 . cor S EXP 540 (obs=540)Number of obs = F( 2, 537) = 100.86 =EXP0.0000 |Prob > F S R-squared = 0.2731 --------+-----------------R-squared = 0.2704 S|Adj 1.0000 MSE = .50274 EXP|Root -0.2179 1.0000 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ E (b2 ) 2 3 S i S EXPi EXP 2 S S i If we omit EXP from the regression, the coefficient of S should be subject to a downward bias. 3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive. 35 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP LGEARN 1 2 S 3 EXP u Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 . cor S EXP 540 (obs=540)Number of obs = F( 2, 537) = 100.86 =EXP0.0000 |Prob > F S R-squared = 0.2731 --------+-----------------R-squared = 0.2704 S|Adj 1.0000 MSE = .50274 EXP|Root -0.2179 1.0000 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 ------------------------------------------------------------------------------ E (b3 ) 3 2 EXP EXP S S EXP EXP i i 2 i For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased. 36 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1235911 .0090989 13.58 0.000 .1057173 .141465 EXP | .0350826 .0050046 7.01 0.000 .0252515 .0449137 _cons | .5093196 .1663823 3.06 0.002 .1824796 .8361596 . reg LGEARN S -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1096934 .0092691 11.83 0.000 .0914853 .1279014 _cons | 1.292241 .1287252 10.04 0.000 1.039376 1.545107 . reg LGEARN EXP -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------EXP | .0202708 .0056564 3.58 0.000 .0091595 .031382 _cons | 2.44941 .0988233 24.79 0.000 2.255284 2.643537 As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. 37 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 . reg LGEARN S Source | SS df MS -------------+-----------------------------Model | 38.5643833 1 38.5643833 Residual | 148.14326 538 .275359219 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 140.05 0.0000 0.2065 0.2051 .52475 . reg LGEARN EXP Source | SS df MS -------------+-----------------------------Model | 4.35309315 1 4.35309315 Residual | 182.35455 538 .338948978 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 12.84 0.0004 0.0233 0.0215 .58219 A comparison of R2 for the three regressions shows that the sum of R2 in the simple regressions is actually less than R2 in the multiple regression. 38 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE . reg LGEARN S EXP Source | SS df MS -------------+-----------------------------Model | 50.9842581 2 25.492129 Residual | 135.723385 537 .252743734 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 100.86 0.0000 0.2731 0.2704 .50274 . reg LGEARN S Source | SS df MS -------------+-----------------------------Model | 38.5643833 1 38.5643833 Residual | 148.14326 538 .275359219 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 140.05 0.0000 0.2065 0.2051 .52475 . reg LGEARN EXP Source | SS df MS -------------+-----------------------------Model | 4.35309315 1 4.35309315 Residual | 182.35455 538 .338948978 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 12.84 0.0004 0.0233 0.0215 .58219 This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation. 39 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems In this section we will investigate the consequences of including an irrelevant variable in a regression model. 1 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Consequences of variable misspecification TRUE MODEL FITTED MODEL Y 1 2 X 2 u Y 1 2 X 2 3 X 3 u Y b1 b2 X 2 Y b1 b2 X 2 b3 X 3 Correct specification, no problems Coefficients are unbiased (in general), but inefficient. Standard errors are valid (in general) Coefficients are biased (in general). Standard errors are invalid. Correct specification, no problems The effects are different from those of omitted variable misspecification. In this case the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. 2 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y 1 2 X 2 u Y b1 b2 X 2 b3 X 3 These results can be demonstrated quickly. 3 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y 1 2 X 2 u Y b1 b2 X 2 b3 X 3 Y 1 2 X 2 0 X 3 u Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b2 will be an unbiased estimator of 2 and b3 will be an unbiased estimator of 0. 4 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y 1 2 X 2 u Y b1 b2 X 2 b3 X 3 Y 1 2 X 2 0 X 3 u u2 1 2 2 1 r X X X2 ,X3 2i 2 2 b2 However, the variance of b2 will be larger than it would have been if the correct simple regression had been run because it includes the factor 1 / (1 - r2), where r is the correlation between X2 and X3. The estimator b2 using the multiple regression model will therefore be less efficient than the alternative using the simple regression model. 5 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y 1 2 X 2 u Y b1 b2 X 2 b3 X 3 Y 1 2 X 2 0 X 3 u u2 1 2 2 1 r X X X2 ,X3 2i 2 2 b2 The intuitive reason for this is that the simple regression model exploits the information that X3 should not be in the regression, while with the multiple regression model you find this out from the regression results. 7 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Y 1 2 X 2 u Y b1 b2 X 2 b3 X 3 Y 1 2 X 2 0 X 3 u u2 1 2 2 1 r X X X2 ,X3 2i 2 2 b2 The standard errors remain valid, because the model is formally correctly specified, but they will tend to be larger than those obtained in a simple regression, reflecting the loss of efficiency. These are the results in general. Note that if X2 and X3 happen to be uncorrelated, there will be no loss of efficiency after all. 8 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE Source | SS df MS ---------+-----------------------------Model | 138.776549 2 69.3882747 Residual | 130.219231 865 .150542464 ---------+-----------------------------Total | 268.995781 867 .310260416 Number of obs F( 2, 865) Prob > F R-squared Adj R-squared Root MSE = = = = = = 868 460.92 0.0000 0.5159 0.5148 .388 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------ The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household expenditure, and LGSIZE, the logarithm of the number of persons in the household. The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868. 10 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS ---------+-----------------------------Model | 138.841976 3 46.2806586 Residual | 130.153805 864 .150640978 ---------+-----------------------------Total | 268.995781 867 .310260416 Number of obs F( 3, 864) Prob > F R-squared Adj R-squared Root MSE = = = = = = 868 307.22 0.0000 0.5161 0.5145 .38812 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not significantly different from zero. 12 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE LGHOUS . cor LGHOUS LGEXP LGSIZE Source | SS df MS Number of obs = 868 (obs=869) ---------+-----------------------------F( 3, 864) = 307.22 Model | 138.841976 3 46.2806586 Prob > F LGEXP = LGSIZE 0.0000 | LGHOUS Residual | 130.153805 864 .150640978 --------+--------------------------R-squared = 0.5161 ---------+-----------------------------Adj R-squared = 0.5145 lGHOUS| 1.0000 Total | 268.995781 867 .310260416 Root MSE1.0000 = .38812 LGEXP| 0.8137 LGSIZE| 0.3256 0.4491 1.0000 -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a lesser extent, with LGSIZE (correlation coefficient 0.33). 13 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 -----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Its inclusion does not cause the coefficients of those variables to be biased. 14 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE . reg LGFDHO LGEXP LGSIZE -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2866813 .0226824 12.639 0.000 .2421622 .3312003 LGSIZE | .4854698 .0255476 19.003 0.000 .4353272 .5356124 _cons | 4.720269 .2209996 21.359 0.000 4.286511 5.154027 -----------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS -----------------------------------------------------------------------------LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------LGEXP | .2673552 .0370782 7.211 0.000 .1945813 .340129 LGSIZE | .4868228 .0256383 18.988 0.000 .4365021 .5371434 LGHOUS | .0229611 .0348408 0.659 0.510 -.0454214 .0913436 _cons | 4.708772 .2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ But it does increase their standard errors, particularly that of LGEXP, as you would expect, reflecting the loss of efficiency. 15 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 42.4015936 2 21.2007968 Residual | 144.30605 537 .26872635 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 78.89 0.0000 0.2271 0.2242 .51839 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1092273 .0091576 11.93 0.000 .0912382 .1272164 WEIGHT85 | .0024192 .0006402 3.78 0.000 .0011616 .0036769 _cons | .9194011 .1609538 5.71 0.000 .6032248 1.235577 ------------------------------------------------------------------------------ Here is a regression of the logarithm of hourly earnings on years of schooling and weight in pounds. The weight coefficient implies than an extra pound leads to 0.24% increase in earnings, so four extra pounds leads to a 1% increase. Can you really believe this? 1 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 42.4015936 2 21.2007968 Residual | 144.30605 537 .26872635 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 78.89 0.0000 0.2271 0.2242 .51839 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1092273 .0091576 11.93 0.000 .0912382 .1272164 WEIGHT85 | .0024192 .0006402 3.78 0.000 .0011616 .0036769 _cons | .9194011 .1609538 5.71 0.000 .6032248 1.235577 ------------------------------------------------------------------------------ Perhaps not, but the t statistic is very highly significant. What is going on? Older people tend to have more work experience, which increases their earnings. They also tend to weigh more. This could be an explanation. 2 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S EXP WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 52.6290507 3 17.5430169 Residual | 134.078593 536 .250146628 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 3, 536) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 70.13 0.0000 0.2819 0.2779 .50015 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1222516 .0090671 13.48 0.000 .1044402 .140063 EXP | .0324871 .0050807 6.39 0.000 .0225066 .0424676 WEIGHT85 | .0016163 .0006303 2.56 0.011 .0003781 .0028545 _cons | .318147 .1815401 1.75 0.080 -.0384704 .6747644 ------------------------------------------------------------------------------ Here we have controlled for work experience. The weight coefficient is lower, but still almost significant at the 1% level. Can you think of any other variable that might be correlated with both earnings and weight? 4 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S EXP MALE WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 60.6259571 4 15.1564893 Residual | 126.081686 535 .235666703 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 4, 535) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 64.31 0.0000 0.3247 0.3197 .48546 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1197587 .0088112 13.59 0.000 .10245 .1370674 EXP | .0282462 .0049849 5.67 0.000 .0184538 .0380386 MALE | .2953164 .0506962 5.83 0.000 .1957283 .3949045 WEIGHT85 | -.0006213 .0007224 -0.86 0.390 -.0020404 .0007978 _cons | .6269889 .1840109 3.41 0.001 .2655164 .9884614 ------------------------------------------------------------------------------ The MALE dummy is such a variable. When it is included, the weight effect disappears. 5 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S EXP MALE WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 60.6259571 4 15.1564893 Residual | 126.081686 535 .235666703 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 4, 535) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 64.31 0.0000 0.3247 0.3197 .48546 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1197587 .0088112 13.59 0.000 .10245 .1370674 EXP | .0282462 .0049849 5.67 0.000 .0184538 .0380386 MALE | .2953164 .0506962 5.83 0.000 .1957283 .3949045 WEIGHT85 | -.0006213 .0007224 -0.86 0.390 -.0020404 .0007978 _cons | .6269889 .1840109 3.41 0.001 .2655164 .9884614 ------------------------------------------------------------------------------ The point of this example is that model misspecification - variable misspecification or indeed any kind of misspecification - in general will invalidate the regression diagnostics, and as a consequence the diagnostics may lead you to the wrong conclusions. 6 VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . reg LGEARN S WEIGHT85 Source | SS df MS -------------+-----------------------------Model | 42.4015936 2 21.2007968 Residual | 144.30605 537 .26872635 -------------+-----------------------------Total | 186.707643 539 .34639637 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 78.89 0.0000 0.2271 0.2242 .51839 -----------------------------------------------------------------------------LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .1092273 .0091576 11.93 0.000 .0912382 .1272164 WEIGHT85 | .0024192 .0006402 3.78 0.000 .0011616 .0036769 _cons | .9194011 .1609538 5.71 0.000 .6032248 1.235577 ------------------------------------------------------------------------------ In the original model, we had two kinds of variable misspecification. We omitted EXP and MALE, and we included the irrelevant variable WEIGHT85. Including an irrelevant variable is one of the few types of misspecification that does not lead to the invalidation of the regression diagnostics. However, omitting relevant variables certainly does. This is why the t statistic in the original specification misled us. 7 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X2, ..., Xk as shown above, and suppose that for some reason there are no data on X2. As we have seen, a regression of Y on X3, ..., Xk would yield biased estimates of the coefficients and invalid standard errors and tests. 1 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Sometimes, however, these problems can be reduced or eliminated by using a proxy variable in the place of X2. A proxy variable is one that is hypothesized to be linearly related to the missing variable. In the present example, Z could act as a proxy for X2. The validity of the proxy relationship must be justified on the basis of theory, common sense, or experience. It cannot be checked directly because there are no data on X2. 3 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u If a suitable proxy has been identified, the regression model can be rewritten as shown. 5 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u We thus obtain a model with all variables observable. If the proxy relationship is an exact one, and we fit this relationship, most of the regression results will be rescued. 6 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 1. The estimates of the coefficients of X3, ..., Xk will be the same as those that would have been obtained if it had been possible to regress Y on X2, ..., Xk. 7 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 2. The standard errors and t statistics of the coefficients of X3, ..., Xk will be the same as those that would have been obtained if it had been possible to regress Y on X2, ..., Xk. 8 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 3. R2 will be the same as it would have been if it had been possible to regress Y on X2, ..., Xk. 9 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 4. The coefficient of Z will be an estimate of 2, and so it will not be possible to obtain an estimate of 2, unless you are able to guess the value of . 10 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 5. However the t statistic for Z will be the same as that which would have been obtained for X2 if it had been possible to regress Y on X2, ..., Xk, and so you are able to assess the significance of X2, even if you are not able to estimate its coefficient. 11 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u 6. It will not be possible to obtain an estimate of 1 since the intercept in the revised model is (1+2), but usually 1 is of relatively little interest, anyway. 12 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u It is generally more realistic to hypothesize that the relationship between X2 and Z is approximate, rather than exact. In that case the results listed above will hold approximately. 13 PROXY VARIABLES Y 1 2 X 2 3 X 3 ... k X k u X 2 Z Y 1 2 ( Z ) 3 X 3 ... k X k u ( 1 2 ) 2 Z 3 X 3 ... k X k u However, if Z is a poor proxy for X2, the results will effectively be subject to measurement error (see Chapter 8). Further, it is possible that some of the other X variables will try to act as proxies for X2, and there will still be a problem of omitted variable bias. 14 PROXY VARIABLES S 1 2 ASVABC 3 INDEX u The use of a proxy variable will be illustrated with an educational attainment model. We will suppose that educational attainment depends jointly on cognitive ability and family background. As usual, ASVABC will be used as the measure of cognitive ability. However, there is no 'family background' variable in the data set. Indeed, it is difficult to conceive how such a variable might be defined. 15 PROXY VARIABLES S 1 2 ASVABC 3 INDEX u INDEX 1 SM 2 SF Instead, we will try to find a proxy. One obvious variable is the mother's educational attainment, SM. However, father's educational attainment, SF, may also be relevant. So we will hypothesize that the family background index depends on both. 17 PROXY VARIABLES S 1 2 ASVABC 3 INDEX u INDEX 1 SM 2 SF S 1 2 ASVABC 3 ( 1 SM 2 SF ) u ( 1 3 ) 2 ASVABC 3 1 SM 3 2 SF u Thus we obtain a relationship expressing S as a function of ASVABC, SM, and SF. 18 PROXY VARIABLES . reg S ASVABC SM SF Source | SS df MS -------------+-----------------------------Model | 1181.36981 3 393.789935 Residual | 2023.61353 536 3.77539837 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 3, 536) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 104.30 0.0000 0.3686 0.3651 1.943 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ Here is the corresponding regression using the subsample of the EAEF data set. 19 PROXY VARIABLES . reg S ASVABC Source | SS df MS -------------+-----------------------------Model | 1081.97059 1 1081.97059 Residual | 2123.01275 538 3.94612035 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 1, 538) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 274.19 0.0000 0.3376 0.3364 1.9865 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ Here is the regression of S on ASVABC alone. 20 PROXY VARIABLES . reg S ASVABC SM SF -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ . reg S ASVABC -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ A comparison of the regressions indicates that the coefficient of ASVABC is biased upwards if we make no attempt to control for family background. 21 PROXY VARIABLES . reg S ASVABC SM SF . cor ASVABC SM SF -----------------------------------------------------------------------------(obs=570) S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------| ASVABC SM SF ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 --------+--------------------------SM | .0492424 .0390901 1.26 0.208 .1260309 ASVABC| 1.0000-.027546 SF | .1076825 .0309522 3.48 SM| 0.001 .1684851 0.4202 .04688 1.0000 _cons | 5.370631 .4882155 11.00 SF| 0.000 6.329681 0.4090 4.41158 0.6241 1.0000 ------------------------------------------------------------------------------ . reg S ASVABC -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .148084 .0089431 16.56 0.000 .1305165 .1656516 _cons | 6.066225 .4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ This is what we should expect. Both SM and SF are likely to have positive effects on educational attainment, and they are both positively correlated with ASVABC. 22 PROXY VARIABLES . reg S ASVABC SM SF LIBRARY SIBLINGS Source | SS df MS -------------+-----------------------------Model | 1191.57546 5 238.315093 Residual | 2013.40787 534 3.77042672 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 5, 534) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 63.21 0.0000 0.3718 0.3659 1.9418 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1245327 .0099875 12.47 0.000 .104913 .1441523 SM | .0388414 .039969 0.97 0.332 -.0396743 .1173571 SF | .1035001 .0311842 3.32 0.001 .0422413 .1647588 LIBRARY | -.0355224 .2134634 -0.17 0.868 -.4548534 .3838086 SIBLINGS | -.0665348 .0408795 -1.63 0.104 -.1468392 .0137696 _cons | 5.846517 .5681221 10.29 0.000 4.730489 6.962546 ------------------------------------------------------------------------------ LIBRARY (a dummy variable equal to 1 if anyone in the family owned a library card when the respondent was 14) and SIBLINGS (number of brothers and sisters of the respondent) are two other variables in the data set which might act as proxies for family background. 23 PROXY VARIABLES . reg S ASVABC SM SF LIBRARY SIBLINGS Source | SS df MS -------------+-----------------------------Model | 1191.57546 5 238.315093 Residual | 2013.40787 534 3.77042672 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 5, 534) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 63.21 0.0000 0.3718 0.3659 1.9418 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1245327 .0099875 12.47 0.000 .104913 .1441523 SM | .0388414 .039969 0.97 0.332 -.0396743 .1173571 SF | .1035001 .0311842 3.32 0.001 .0422413 .1647588 LIBRARY | -.0355224 .2134634 -0.17 0.868 -.4548534 .3838086 SIBLINGS | -.0665348 .0408795 -1.63 0.104 -.1468392 .0137696 _cons | 5.846517 .5681221 10.29 0.000 4.730489 6.962546 ------------------------------------------------------------------------------ The LIBRARY variable was one of three variables included in the National Longitudinal Survey of Youth to help pick up the influence of family background on education. Surprisingly, it has a negative coefficient, but it is not significant. 24 PROXY VARIABLES . reg S ASVABC SM SF LIBRARY SIBLINGS Source | SS df MS -------------+-----------------------------Model | 1191.57546 5 238.315093 Residual | 2013.40787 534 3.77042672 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 5, 534) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 63.21 0.0000 0.3718 0.3659 1.9418 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1245327 .0099875 12.47 0.000 .104913 .1441523 SM | .0388414 .039969 0.97 0.332 -.0396743 .1173571 SF | .1035001 .0311842 3.32 0.001 .0422413 .1647588 LIBRARY | -.0355224 .2134634 -0.17 0.868 -.4548534 .3838086 SIBLINGS | -.0665348 .0408795 -1.63 0.104 -.1468392 .0137696 _cons | 5.846517 .5681221 10.29 0.000 4.730489 6.962546 ------------------------------------------------------------------------------ There is a tendency for parents who are ambitious for their children to limit their number, so SIBLINGS should be expected to have a negative coefficient. It does, but it is also not significant. 25 PROXY VARIABLES . reg S ASVABC SM SF LIBRARY SIBLINGS Source | SS df MS -------------+-----------------------------Model | 1191.57546 5 238.315093 Residual | 2013.40787 534 3.77042672 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 5, 534) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 63.21 0.0000 0.3718 0.3659 1.9418 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1245327 .0099875 12.47 0.000 .104913 .1441523 SM | .0388414 .039969 0.97 0.332 -.0396743 .1173571 SF | .1035001 .0311842 3.32 0.001 .0422413 .1647588 LIBRARY | -.0355224 .2134634 -0.17 0.868 -.4548534 .3838086 SIBLINGS | -.0665348 .0408795 -1.63 0.104 -.1468392 .0137696 _cons | 5.846517 .5681221 10.29 0.000 4.730489 6.962546 ------------------------------------------------------------------------------ There are further background variables which may be relevant for educational attainment: faith, ethnicity, and region of residence. These variables are supplied in the data set, but it will be left to you to experiment with them. 26 F TEST OF A LINEAR RESTRICTION S 1 2 ASVABC 3 SM 4 SF u In the last sequence it was argued that educational attainment might be related to cognitive ability and family background, with mother's and father's educational attainment proxying for the latter. 1 F TEST OF A LINEAR RESTRICTION . reg S ASVABC SM SF Source | SS df MS -------------+-----------------------------Model | 1181.36981 3 393.789935 Residual | 2023.61353 536 3.77539837 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 3, 536) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 104.30 0.0000 0.3686 0.3651 1.943 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ However, when we run the regression, we find the coefficient of mother's education is not significant. 2 F TEST OF A LINEAR RESTRICTION . reg S ASVABC SM SF Source | SS df MS -------------+-----------------------------Model | 1181.36981 3 393.789935 Residual | 2023.61353 536 3.77539837 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 3, 536) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 104.30 0.0000 0.3686 0.3651 1.943 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 -----------------------------------------------------------------------------. cor SM SF (obs=540) | SM SF --------+-----------------SM| 1.0000 SF| 0.6241 1.0000 As was noted in one of the lecture for Chapter 3, this might be due to multicollinearity, because mother's education and father's education are correlated. 3 F TEST OF A LINEAR RESTRICTION S 1 2 ASVABC 3 SM 4 SF u In the discussion of multicollinearity, several measures for alleviating the problem were suggested, among them the use of an appropriate theoretical restriction. 4 F TEST OF A LINEAR RESTRICTION S 1 2 ASVABC 3 SM 4 SF u 4 3 In particular, in the case of the present model, it was suggested that the impact of parental education might be the same for both parents, that is, that 3 and 4 might be equal. 5 F TEST OF A LINEAR RESTRICTION S 1 2 ASVABC 3 SM 4 SF u 4 3 S 1 2 ASVABC 3 ( SM SF ) u 1 2 ASVABC 3 SP u SP SM SF If this is the case, the model may be rewritten as shown. We now have a total parental education variable, SP, instead of separate variables for mother's and father's education, and the multicollinearity caused by the correlation between the latter has been eliminated. 6 F TEST OF A LINEAR RESTRICTION . g SP=SM+SF . reg S ASVABC SP Source | SS df MS -------------+-----------------------------Model | 1177.98338 2 588.991689 Residual | 2026.99996 537 3.77467403 -------------+-----------------------------Total | 3204.98333 539 5.94616574 Number of obs F( 2, 537) Prob > F R-squared Adj R-squared Root MSE = = = = = = 540 156.04 0.0000 0.3675 0.3652 1.9429 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608 ------------------------------------------------------------------------------ Here is the regression with SP replacing SM and SF. 7 F TEST OF A LINEAR RESTRICTION . reg S ASVABC SM SF -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ . reg S ASVABC SP -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608 ------------------------------------------------------------------------------ A comparison of the regressions reveals that the standard error of the coefficient of SP is much smaller than those of SM and SF, and consequently its t statistic is higher. Its coefficient is a compromise between those of SM and SF, as might be expected. 8 F TEST OF A LINEAR RESTRICTION . reg S ASVABC SM SF -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ . reg S ASVABC SP -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1253106 .0098434 12.73 0.000 .1059743 .1446469 SP | .0828368 .0164247 5.04 0.000 .0505722 .1151014 _cons | 5.29617 .4817972 10.99 0.000 4.349731 6.242608 ------------------------------------------------------------------------------ However, the use of a restriction will lead to a gain in efficiency only if the restriction is valid. If it is not valid, its use will lead to biased coefficients and invalid standard errors and tests. 9 F TEST OF A LINEAR RESTRICTION . reg S ASVABC SM SF -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | .1257087 .0098533 12.76 0.000 .1063528 .1450646 SM | .0492424 .0390901 1.26 0.208 -.027546 .1260309 SF | .1076825 .0309522 3.48 0.001 .04688 .1684851 _cons | 5.370631 .4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ . reg S ASVABC

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Algebra Math 1st Grade Workbook

Authors: Jerome Heuze

1st Edition

979-8534507850

More Books

Students also viewed these Mathematics questions

Question

When a solution is found, is it guaranteed to be optimalpg24

Answered: 1 week ago

Question

Formal Education explain?

Answered: 1 week ago

Question

Non formal Education explain?

Answered: 1 week ago

Question

Goals of Education System?

Answered: 1 week ago

Question

What is privatization?

Answered: 1 week ago