Answered step by step
Verified Expert Solution
Question
1 Approved Answer
fWeek 4 Lecture 10 We have been examining the question of equal pay for equal work for several weeks now; but have been somewhat frustrated
\fWeek 4 Lecture 10 We have been examining the question of equal pay for equal work for several weeks now; but have been somewhat frustrated with the equal work part. We suspect that salary varies with grade level, so that equal work is not done if we compare salaries across grades. We found that we could control the effect of grades with either of two techniques. The first is by choosing a variable that does not include grade level variation such as compa-ratios (the salary divided by midpoint). The second by statistically removing the impact of grade level using the ANOVA Two-factor without replication. Both of these gave us different outcomes on the question of male and female pay equality than examining salary only. However, we still have not gotten a \"clean\" measure of equal work as there are still other factors that may impact work done such as performance levels (measured by the performance appraisal rating), seniority, education, etc. And, there could be gender bias (and, for real world companies, ethnic bias as well. We will not cover this, but it can be dealt with the same way as we will examine gender). We need to find a way to eliminate the impact of these variables on our pay measure as well. This week we will look at two techniques that are very good at examining and explaining the influence of variables on outcomes. These are correlation and regression techniques. Linear Correlation Correlation is a measure of how variables/things relate - that is, if one variable changes does another variable change in a predictable pattern as well? One very well-known example is the correlation (or relationship) between length/height of children and weight. As children become longer/taller their weight also increases (Tanner & Youssef-Morgan, 2013). Using this relationship, we can make predictions (using the technique of regression discussed in Lecture 11 for this week) about how heavy a child should be for any given height. For variables that are at least interval in nature, two types of correlation exist for a bivariable (two variables only) relationship- linear and curvilinear. As they sound, linear correlations show the extent to which the data variables move in a straight line. Curvilinear correlations - which we will not cover - show the extent that variables move in curved lines. Scatter Diagrams An effective way to see if the data do relate in predictable ways involves generating a scatter diagram (AKA scatter chart) - a visual display of how the data points - (variable 1 value, corresponding variable 2 value) relate together (Lind, Marchel, & Wathen, 2008). Example1. One relationship we might expect to show a positive (both values increasing) relationship would be salary and performance rating, either for the entire salary range or at least within grades. The following scatter diagram (made with the Excel Insert Graph functions) show the relationship with Performance Rating on the bottom and Salary on the on the vertical axis. It shows if we put a straight line through the data points, there is a very modest increase from the lower left to upper right. Salary (Y-axis) and Performance Rating (X-axis) Example2. If we look at the same variables, but include Grade as a factor, we get the second graph (below) and see the data separated by grade. Each grade seems to show (again, if we were to put a straight line thru the data points for each grade) level lines, indicating no correlation at all. Neither graph gives us much hope that Performance Rating is related to Salary , something HR would probably not be happy with. Salary Grades (Y-axis) and Performance Appraisal Rating (X-axis) Correlation We will be focusing our efforts on the Pearson Correlation Coefficient - a mathematical value that shows the strength of the linear (straight line) relationship between two variables (Lind, Marchel, & Wathen, 2008). The math formula is a bit tedious, so we will not bother with it - but, if interested, you can ask Excel to display it (either with Help or the \"Tell me what you want to do.\" With the latter, I typed show help on Pearson Correlation, and then selected the \"show help...\" line, getting a description and the math formula.). Pearson correlation ranges from a value of -1.00 to a +1.00. Any value outside of this range indicates an error in the math or setup. A perfect negative correlation (-1.00) means that the data points all fit exactly on a line that runs from the upper left corner to the lower right on a graph, a negative slope. A perfect positive correlation (+1.00) has the line with a positive slope and runs from the lower left to the upper right (Tanner & Youssef-Morgan, 2013). As the values move away from the perfect extremes, the data points move away from a line to a spread around the line. If we look at our first graph above, the overall Salary and Performance Rating relationship, we have a correlation of +.15, considered very low and not particularly impressive. Pearson Correlation. Excel finds the Pearson Correlation Coefficient using either the fx function Correl or the Data Analysis function Correlation. The former is used for a single data set with two variables, while the latter can be used for a single or multiple data sets. The Correl output for the Performance Rating and Salary correlation result is: Column 1 Column 2 Column 1 1 Column 2 0.151307 1 Note the variable names are not included, and we have three correlations. Two will always show a perfect +1.00 correlation of column 1 with column 1 and column 2 with column 2; a diagonal convention makes more sense with the Correlation table we will look at below. The third correlation is the column 1 with column 2 variable. It does not matter which variable is considered in column 1 or 2, as the result will be the same as switching the variable columns. We can use the Correlation function to identify correlations between multiple data sets at the same time, much as Descriptive Statistics could work with multiple variables at once. In trying to identify what variables might be impacting Salary, we could generate the following table. Remember, that Pearson's Correlation requires at least interval level data, so that not all of our variables are used. In addition, since Salary and Compa-ratio are two measures of the same thing (pay) we do not want to include them in the same table. Sal Mid Age Perf Rat Service Sal 1.000 Mid 0.989 1.000 Age 0.544 0.567 1.000 Perf Rat 0.151 0.192 0.139 1.000 Service 0.452 0.471 0.565 0.226 1.000 Raise -0.041 -0.029 -0.180 0.674 0.103 Raise 1.000 To identify all of the correlations for a single variable, find the name in the left column. Then go across until you reach the 1.00 value, then go down. For age, we find that the correlation with: Age = 0.544, Mid = 0.567, Age (itself) = 1.00, Perf Rat = 0.139, Service = 0.565, and Raise = -0.180. Side note: now we can see why the correlation with itself is shown in the tables, it provides the pivot point for reading the table outcomes. The values above this diagonal of 1.00 values would be identical to those below, so they are not provided to make the table visually easier to read. Coefficient of Determination. We will look at determining statistical significance of correlations in lecture three for this week. But, in the meantime, we can consider the Coefficient of Determination as a rough measure of usefulness (we will look at the effect size measure in lecture three as well). The coefficient of determination is the square of the correlation, and represents the percent of variation that the variables share in common; that is, the amount of variation in one variable's changes that is explained by the variation in the other variable. So, for age and salary, the coefficient equals 0.5442 = .30 (rounded). As a rule of thumb, variable pairs with coefficients less than (<) 70% are generally not very valuable for prediction purposes. References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education. Week 4 Lecture 11 Regression Analysis Regression analysis is the development of an equation that shows the impact of the independent variables (the inputs we can generally control) on the output result. While the mathematical language may sound strange, most of you are quite familiar with regression like instructions and use them quite regularly. To make a cake, we take 1 box mix, add 1 cups of water, cup of oil, and 3 eggs. All of this is combined and cooked. The recipe is an example of a regression equation. The output (or result or dependent variable) is the cake, the inputs (or independent variables) are the inputs used. Each input is accompanied by a coefficient (AKA weight or amount) that tells us how \"much\" of the variable is \"used\" or weighted into the outcome. So, in an equation format, this cake recipe might look like: Y = 1X1 + 1.25X2 + .5X3 + 3X4 where: Y = cake X1 = box mix X2 = cups of water X3 = cups of oil X4 = an egg. Of course, for the cake, the recipe needs to go through the cooking process; while for other regression equations the outputs need to go through whatever \"process\" turns the inputs into the output - this is often called \"life.\" Example With a regression analysis, we can identify what factors influence an outcome. So, with our Salary issue, the natural question to help us answer our research question of do males and females get equal pay for equal work would be: what factors influence or explain an individual's pay? This is a perfect question for a multi-variate regression. Multi-variate simply means we have multiple input variables with a single output variable (Lind, Marchel, & Wathen, 2008). Variables. A regression analysis uses two distinct types of data. The first are variables that are at least interval level or better (the same as the other techniques we have used so far). The other is called a dummy variable, a variable that can be coded 0 or 1 indicating the presence of some characteristic. In our data set, we have two variables that can be used as dummy coded variables in a regression, Degree and Gender; both coded 0 or 1. In the case of Degree, the 0 stands for having a bachelor's degree and the 1 stands for having an advanced degree. For Gender, 0 means a male and 1 means a female. How these are interpreted in a regression output will be discussed below. For now, the significance of dummy coding is that it allows us to include nominal or ordinal data in our analysis. Excel Approach. For our question of what factors influence pay, we will use Excel's Regression function found in the Data Analysis section. This function will produce two output tables of interest. The first table tests to see if the entire regression equation is statistically significant; that is, do the input variables significantly impact the output variable. If so, we would then examine the second table - the coefficients used in a regression equation for each of the variables. We would have a second set of hypothesis statements for each variable, the null would be the coefficient equals 0 versus an alternate of the coefficient is not equal to 0. Typically, we list these before we start the analysis. Step 1: For the regression equation: Ho: The regression equation is not significant Ha: The regression equation is significant. For the coefficients if the regression equation is significant: Ho: The regression coefficient equals 0 Ha: The regression coefficient is not equal to 0. Note: We would write one pair of statements for each variable, for space reasons, we include only one general statement that should be applied to each variable. Step 2: Reject each null hypothesis claim if the related p-value < (is less than) p-value = .05. Step 3: Regression Analysis Step 4: Perform the test. Selecting the Regression option in Data Analysis will open a familiar data entry box. The Input Y Range would be the salary range including the label. The Input X range would the labels and data for our input variables. In this case we will use Midpoint, Age, Performance Rating, Service, Raise, Degree, and Gender. Be sure to check the labels box and pick and output range upper left corner. This will result in the following output (values rounded to three decimal places): Step 5: Conclusions and Interpretation. Let's look at each table separately. The Regression Statistics table shows a Multiple R and an R squared value. Multiple R is the multiple correlation value. Similar to our Pearson Coefficient it shows the relationship between the dependent (output or Salary in this case) variable with all for the independent or input variables. R Square is the multiple coefficient of determination, similar to the Pearson coefficient of determination, it displays the percent of variation in common between the dependent and all of the independent variables. The adjusted R square reduces the R square by a factor that involves the number of variables and the sample size, a suggestion if the design impacted the outcome more than the variables. We have an insignificant reduction. The standard error is a measure of variation in the outcome used for predictions. The count shows the number of cases used in the regression. The ANOVA table, sometimes called ANOR - analysis of regression - provides us with our test of significance outcome. Similar to the ANOVA covered in Week 3, we look at the Significance of F (AKA P-value) to see if we reject or fail to reject the null hypothesis of no significance. In this case, with a p-value of 8.44E-36 (equaling 0.00000000000000000000000000000000000844) is less than .05, so we reject the null of no significance. The regression equation explains a significant proportion of the variation in our dependent variable of salary. Now that we have a significant regression equation, we move on to the final table that presents and tests the coefficients for each variable. One of the important parts of a regression equation is that it shows us the impact of each factor if all other factors are held constant. A regression has the form: Y = A + B1* X1 + B2*X2 + B3*X3 + .... Where Y is the output, A is the intercept (places the line up or down on the Y axis when all other values are 0), the B's are the coefficient values, and the X's are the variable names. Before considering whether each coefficient is statistically significant or not, our equation would be: Salary - -4.009 + 1.22* Midpoint + 0.029*Age - 0.096*Perf Rat - 0.074*Service + 0.834*Raise + 1.002*Degree + 2.552* Gender. Whew! What does this mean? The intercept is an adjustment factor, one that we do not need to analyze. For midpoint, it means that as midpoint goes up by a thousand dollars (remember salary and midpoint are measured in thousands), the salary goes up by 1.22 thousand - higher graded employees are paid relatively more compared to midpoint than others (all others things equal). For Performance Rating, employees lose $96 (-0.096) for every higher PR point they have - certainly not what HR would like! Now, let's look at our dummy variables, Degree and Gender. For Degree, an extra $1,002 is added to employees having a Deg code = 1, as if Deg = 0, the +1.002* 0 = 0; so graduate degree holders get an extra $1002 per year. The same thing applies to Gender, those coded 0 get nothing extra and those coded 1 get $2,552 more per year (all other things equal). Since females are coded 1, if this factor is significant, they would be paid $2552 more than males with all other factors equal (the definition of equal work). So, now let's take a look at the statistical significance of each of the variables. This is determined with the P-value column (next to the t Stat value). This is read the same way we noticed in the t-test and ANVOA tables, if the value is less than 0.05 we reject the null hypothesis of no significance. While the intercept has a significance value, we tend to ignore it and include the intercept in all equations. For the other variables, the only significant variables are: Midpoint, Perf Rating (unrounded it was 0.0497994...), and Gender. So, the regression equation including only our statistically significant factors is Sal = -4.009 +1.22*Midpoint -).096*Perf Rat + 2.552*Gender. So, we now have a clear answer to our question about males and females getting equal pay for equal work. Not only is the answer no (as gender is a significant factor in determining salary) but also females are paid $2552 more annually all other things equal! This is certainly not the outcome most of us expected when we began this journey. What we see is that variation within any measure has some often unanticipated outcomes, and unless we examine the inputs into our results, we often do not understand them very well. Single measure tests such as the t and ANOVA tests are quite valuable comparing similar results, but they do not always get to the root of what causes differences. Reference Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Week 4 Lecture 12 Significance Earlier we discussed correlations without going into how we can identify statistically significant values. Our approach to this uses the t-test. Unfortunately, Excel does not automatically produce this form of the t-test, but setting it up within an Excel cell is fairly easy. And, with some slight algebra, we can determine the minimum value that is statistically significant for any table of correlations all of which have the same number of pairs (for example, a Correlation table for our data set would use 50 pairs of values, since we have 50 members in our sample). The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1-r2); the associated degrees of freedom are n-2 (number of pairs - 2) (Lind, Marchel, & Wathen, 2008). For some this might look a bit off-putting, but remember that we can translate this into Excel cells and functions and have Excel do the arithmetic for us. Excel Example If we go back to our correlation table for salary, midpoint, Age, Perf Rat, Service, and Raise, we have: Using Excel to create the formula and cell numbers for our key values allows us to quickly create a result. The T.dist.2t gives us a p-value easily. The formula to use in finding the minimum correlation value that is statistically significant is r = sqrt(t^2/(t^2 + n-2)). We would find the appropriate t value by using the t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48. Plugging these values into the gives us a t-value of 2.0106 or 2.011(rounded). Putting 2.011 and 48 (n-2) into our formula gives us a r value of 0.278; therefore, in a correlation table based on 50 pairs, any correlation greater or equal to 0.278 would be statistically significant. Technical Point. If you are interested in how we obtained the formula for determining the minimum r value, the approach is shown below. If you are not interested in the math, you can safely skip this paragraph. t = r* sqrt(n-2)/sqrt(1-r2) Multiplying gives us t *sqrt (1- r2) = r2* (n-2) Squaring gives us: t2 * (1- r2) = r2* (n-2) Multiplying out gives us: t2- t2* r2 = n r2-2* r2 Adding gives us: t2= n* r2-2*r2+ t2 *r2 Factoring gives us t2= r2 *(n -2+ t2) Dividing gives us t2 / (n -2+ t2) = r2 Taking the square root gives us r = sqrt (t2 / (n -2+ t2) Effect Size Measures As we have discussed, there is a difference between statistical and practical significance. Virtually any statistic can become statistically significant if the sample is large enough. In practical terms, a correlation of .30 and below is generally considered too weak to be of any practical significance. Additionally, the effect size measure for Pearson's correlation is simply the absolute value of the correlation; the outcome has the same general interpretation as Cohen's D for the t-test (0.8 is strong, and 0.2 is quite weak, for example) (Tanner & YoussefMorgan, 2013). Spearman's Rank Correlation Another type of correlation is the Spearman's rank order correlation. This correlation, which is interpreted the same way as the Pearson's Correlation, can be performed on ordinal or any ranked data. If the data used is ordinal (rankable), we use Spearman's rank order correlation, rho (Tanner & Youssef-Morgan, 2013). Using the same data, only assuming at least one variable is ordinal would give us the following results. Note in ranking from low to high, similar values are given the average rank for all of the same values. For example, in the example below the raise of 4.7 occurs twice (the 3rd and 4th places), so it gets a rank of 3.5. PRRank 1 2 4 9 9 4 4 9 6.5 6.5 Performance Rating Raise Raise Rank 55 75 80 100 100 80 80 100 90 90 3 3.6 4.7 4.7 4.8 4.9 5.6 5.7 5.8 6 1 2 3.5 3.5 5 6 7 8 9 10 Difference in rank 0 0 0.5 5.5 4 -2 -3 1 -2.5 -3.5 Sum = Difference squared 0 0 0.25 30.25 16 4 9 1 6.25 12.25 79 Spearman's rank order correlation = 1-6*sum of differences squared/(n*(n2 -1)) For this data, the sum of differences = 79, and n = 10. This gives us a value of 1-6*(79/(10 *(102 -1))79 = 1 - 6* (79/(10*99) = 1-6 * (79/990) = 1 - 6*0.08 = 0.52. For comparison purposes, the Pearson Correlation equals 0.686. Note that we have less information about the data when we use ranks, particularly with several ties in the data. This reduced information results in a lower correlation value with Spearman's. This correlation is tested and interpreted the same way as Pearson's Coefficient is (Lind, Marchel, & Wathen, 2008). References Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008). Statistical Techniques in Business & Finance. (13th Ed.) Boston: McGraw-Hill Irwin. Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for Managers. San Diego, CA: Bridgeport Education
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started