respond question 1 (a,b,c,d)
c.testing environment
d. test administrator
Here is information from the book
Objective 2 Identify those factors that influence reliability and objectivity for norm-referenced test scores. 1. Acceptable reliability is essential for all measurements. Many factors affect the reliability of a measurement. Identify the conditions under which the highest reliability can be expected for the following factors: a. People tested b. Test length 108 Part II: Quantitative Aspects of Measurement into two groups and each group is tested by a dif- person. Probably in the actual measurement pro- ferent person. If a high degree of objectivity is lack- gram or research study a single judge will be used. ing because the two scorers of the test use different Thus the formula for estimating the reliability of a administrative procedures or scoring standards, a criterion score that is a single score must be used. person's score is dependent on the identity of the To calculate the objectivity coefficient, we scorer. If one scorer is more lenient than the other, think of the judges as trials, inserting their indi- the people tested by that scorer have an advantage.vidual scores into the trial terms of our reliability A high degree of objectivity is also needed when formulas. If all judges are supposed to be using the one person scores on several occasions. For exam- same standards, we could consider a difference ple, a scorer may measure one-third of a group on among judges to be measurement error and would each of 3 days, or the entire group at the beginning estimate objectivity using the one-way ANOVA and end of a teaching or training unit. In the first (Formula 4.2). case, it is essential that the same administrative If all judges are not expected to use the same+ procedures and scoring standards be used each standards, we would estimate objectivity using day. This is true in the second case as well, where the two-way ANOVA (Formula 4.4). Because any variation in a person's scores should represent objectivity is a type of reliability (rater reliability), changed performance, not changed procedures or information presented earlier in this chapter- standards. sample size for R. acceptable reliability, and so Some authors discuss objectivity as being on-generally applies when calculating and inter- intrajudge or interjudge. Initrajudge obiectivity is preting objectivity coefficients the degree of agreement between scores assigned to each person by one judge when viewing each person's performance on two different occasions. RELIABILITY OF For example, if people were videotaped as they CRITERION-REFERENCED performed a physical activity or test. a judge could view the videotape and score each person on each TEST SCORES of two different days. Interjudge objectivity is the degree of agreement between the scores assigned Based on a criterion-referenced standard, a person each person by two or more judges. Obiectivity is classified as either proficient or nonproficient. as defined and discussed in this chapter is inter either pass or fail. For example, in a Red Cross cer judge. Intrajudge objectivity is vital to interjudge tification course a person either meets the min imum requirements and is certified or does not bi hu each of 3 days, or the entire group at the beginning estimate objectivity using the one-way ANOVA and end of a teaching or training unit. In the first (Formula 4.2). case, it is essential that the same administrative If all judges are not expected to use the same procedures and scoring standards be used each standards, we would estimate objectivity using day. This is true in the second case as well, where the two-way ANOVA (Formula 4.4). Because any variation in a person's scores should represent objectivity is a type of reliability (rater reliability), changed performance, not changed procedures or information presented earlier in this chapter- standards. sample size for R. acceptable reliability, and so Some authors discuss objectivity as being on-generally applies when calculating and inter- intrajudge or interjudge. Intrajudge objectivity is preting objectivity coefficients. the degree of agreement between scores assigned to each person by one judge when viewing each person's performance on two different occasions. RELIABILITY OF For example, if people were videotaped as they CRITERION-REFERENCED performed a physical activity or test, a judge could view the videotape and score each person on each TEST SCORES of two different days. Interjudge objectivity is the Based on a criterion-referenced standard, a person degree of agreement between the scores assigned each person by two or more judges. Objectivity is classified as either proficient or nonproficient, + cither pass or fail. For example, in a Red Cross cer- as defined and discussed in this chapter is inter tification course a person either meets the min- judge. Intrajudge objectivity is vital to interjudge imum requirements and is certified or does not objectivity, but high intrajudge objectivity does meet the minimum requirements and is not certi- not guarantee high interjudge objectivity. In most fied. In another example, the criterion-referenced measurement situations, high intrajudge objectiv- standard for an adult fitness course could be the ity is assumed. Usually intrajudge objectivity will ability to jog continuously for 30 minutes. Based be higher than interjudge objectivity on this standard, a person is classified as either proficient or nonproficient. A very nice charac- Estimation teristic of a criterion-referenced standard is that To estimate the degree of objectivity for a phys- there is no predetermined quota as to how many ical performance test score in a pilot study two people are classified as proficient. or more judges score each person as he or she is Criterion-referenced reliability is defined dif tested. Then we calculate an intraclass correlation ferently from norm-referenced reliability. In the coefficient on the basis of judges scores of each enterion-referenced case, reliability is defined Chapter 4: Reliability and Objectivity 109 TABLE 4.10 Estimating Reliability of a Criterion-Referenced Test Day 2 TABLE 4.11 Data for Determining the Reliability of a Criterion-Referenced Test Pass Fail Trial 2 Day 1 Pass A B Pass Fail Fail C D Trial ! Pass 84 21 Fail 5 40 as consistency of classification. Thus, if a criteri- on-referenced test score is reliable, a person will be classified the same on each of two occasions. This could be trial-to-trial within a day or day-to-day. To estimate the reliability of a criteri on-referenced test score, create a double classification table, as presented in Table 4.10. In the A-box is the number of people who passed on both days and in the D-box is the number of people who failed on both days. Notice that the B-box and C-box are the numbers of people who were not classified the same on both occasions. Obviously, larger numbers in the A-box and D. box and smaller numbers in the B box and C box are desirable, because reliability is consistency of classification on both occasions. All packages of statistical computer programs have a cross-tabu- lation program that will provide the double classi fication table (see Table 4.10) needed to estimate the reliability of a criterion-referenced test. The most popular way to estimate reliability from this double-classification table is to calculate the proportion of agreeme (P), where AD A+B+C+D the surstof the four boxes is 150, P=0.83: 84 +40 P= 0.827 0.83 84+21+5+ 40 150 The proportion of agreement (P) does not allow for the fact that some same classifications on both occasions may have happened totally by chance. The Coccint (k) corrects for chance agreements. The formula for kappa is PaPc (4.9) 1-PC where Pa is the proportion of agreement and Pc is the proportion of agreement expected by chance (A+BYA+C)+(C+D)(B+D) P (A+B+C+D) in a 22 double classification table. Peables Determine the kappa coefficient (k) for the criterion referenced test scores using the data in Table 4.11. Solution WC TOLUCIOUS ko classified the same on each of two occasions. This could be trial-to-trial within a day or day-to-day. 84 +40 124 P- 0.827=0.83 To estimate the reliability of a criteri- 84+21+5+40 150 on-referenced test score, create a double- The proportion of agreement (P) does not classification table, as presented in Table 4.10. In allow for the fact that some same classifications the A-box is the number of people who passed on both days and in the D-box is the number of by chance. The kappa coefficient (k) corrects for on both occasions may have happened totally people who failed on both days. Notice that the chance agreements. The formula for kappa is B-box and C-box are the numbers of people who were not classified the same on both occasions. Pa-PC (4.9) Obviously, larger numbers in the A-box and D- 1-PC box and smaller numbers in the B-box and C-box where Pa is the proportion of agreement and Pe is are desirable, because reliability is consistency of classification on both occasions. All packages of the proportion of agreement expected by chance statistical computer programs have a cross-tabu- (A+B)(A+C)+(C+D)(B+D) lation program that will provide the double-classi- (A+B+C+D) fication table (see Table 4.10) needed to estimate the reliability of a criterion-referenced test. in a 2 x 2 double-classification table. The most popular way to estimate reliability Problem 4.8 from this double classification table is to calculate the proportion of agreement coefficient (P), where Determine the kappa coefficient (1) for the criterion-referenced test scores using the data in A+D Table 4.11 P 48 A+B+C+D Solution To solve fork, we use a three-step procedure: Problem 4.7 Step 1 Determine the proportion of agreement (P) for Calculate Pa the criterion-referenced test scores using the data A+D 84 +40 Pa in Table 4.11. A B D 84+21+5+40 Solution 124 Where the sum of the A-box and D boxi 124 and 0.827 0.83 150 110 Part II: Quantitative Aspects of Measurement Step 2 evaluating people in terms of their scores at the Calculate Pc: end of the program puts those who start the pro- Pc= (A+B)(A+C)(C+D)(B+D) gram with low scores at a disadvantage to people (A+B+C+D) who start the program with high scores. So, dif- ference scores are sometimes calculated to deter (105)(89)+(45)(61) 9345 +2745 mine the degree to which each person's score has (84+21+5+40) 150 changed over time--from the beginning to the 12,090 end of an instructional program or training pro- 0.537 =0.54 22,500 gram. A difference score can be quickly and easily calculated as Step 3 Calculate kappa: Difference score = Final score - Initial score Pa-Pc 0.83 -0.54 0.29 k= where the initial score is at the beginning and 0.63 1-PC 1-0.54 0.46 the final score is at the end of an instructional or training program . Notice that for the data in Table 1.11 there is Two real problems are posed by the use of a definite difference between the proportion of difference scores for purposes of comparison agreement (P) and the kappa coefficient (k). A among individuals, comparison among groups. more extensive discussion of both coefficients is and development of performance standards. First, presented in statistics books with a chapter on individuals who perform well initially do not have measurement theory the same opportunity to achieve large difference Because P can be affected by classifications by scores as individuals who begin poorly. For exam- chance, values of Pless than 0.50 are interpreted as ple, the person who initially runs a 6 minute mile unacceptable. Thus, values of P need to be closer has less opportunity for improvement than the to Lo than to 0.50 to be quite acceptable. Values of person who initially runs a 9-minute mile. Sec- k also should be closer to 10 than 0.0 to be ond, difference scores are tremendously unreli- quite acceptable. able. The formula for estimating the reliability of difference scores (YX) is as follows: RELIABILITY OF RRS-2R, 4.10 +5.-28.5 DESTRUENCE SCORES KE 0.63 1-PC 1 -0.54 0,46 the final score is at the end of an instructional or training program. Notice that for the data in Table 4.11 there is Two real problems are posed by the use of a definite difference between the proportion of difference scores for purposes of comparison agreement (P) and the kappa coefficient (k). A among individuals, comparison among groups, more extensive discussion of both coefficients is and development of performance standards. First, presented in statistics books with a chapter on individuals who perform well initially do not have measurement theory. the same opportunity to achieve large difference Because P can be affected by classifications by scores as individuals who begin poorly. For exam- chance, values of Pless than 0.50 are interpreted as ple, the person who initially runs a 6-minute mile unacceptable. Thus, values of P need to be closer has less opportunity for improvement than the to 1.0 than to 0.50 to be quite acceptable. Values of person who initially runs a 9-minute mile. Sec- k also should be closer to 1.0 than 0.0 to be quite able. The formula for estimating the reliability ond, difference scores are tremendously unreli- acceptable. of difference scores (Y - X) is as follows: R$+RS-2RS. RELIABILITY OF RA (4.10) 2R 5,5 DIFFERENCE SCORES where In the first edition of Baumgartner and Jackson X Initial score (1975) and the next two editions of the book, YFinal score difference scores were discussed. Difference scores Rand R Reliability coefficients for are sometimes called change scores or improve- tests X and Y ment scores. Many people are still using differ Correlation between tests X and Y ence scores, and Kane and Lazarus (1999) have Reliability of difference scores presented an improved technique for calculating and Standard deviations for the difference scores. So, difference scores are bricily initial and final scores discussed here If people start an instructional program or The following is an example of estimating training program with markedly different scores the reliability of difference scores. A test with R Chapter 4: Reliability and Objectivity 111 reliability of 0.90 was administered at the begin- 3. Using simple prediction, which is ning and end of a training program. As is usually available in the SPSS package of computer the case, the reliability of the test did not change programs, a change score (C) is predicted between administrations. The correlation between for each person with the independent or the two sets of scores was 0.75. The standard devi- predictor variable being the initial score ation was 2.5 for the initial scores and 2.0 for (1), so that C=(b)(I) + a. the final scores. Thus, the reliability of difference 4. A difference score (D) is calculated for scores is: each person, where D=C-C. 0.90(2.5) +0.90(2.0) - 2(0.75)(2.5)(2.0) A positive difference score is better than a nega- Ru (2.5)+(2.0) -2(0.75)(2.5)(2:0) tive difference score, because the change score (C) =0.63 is greater than was predicted. It is apparent from the formula that highly reli+ able measures between which the correlation is SUMMARY low are necessary if the difference scores are going to be reliable. Because difference scores are usually Three characteristics are essential to good measure- calculated from scores of a test administered at the ment: reliability, objectivity, and validity. Reliability beginning and end of an instructional or training and objectivity were the focus of this chapter. A mea- program, and the correlation between these two surement has reliability when it is consistently the sets of scores is normally 0.70 or greater, highly same for each person over a short period of time. Two reliable difference scores appear impossible, types of reliability are applicable to norm-referenced Other techniques for calculating difference tests: stability and internal consistency. scores have been presented. Kane and Lazarus Objectivity, the second vital characteristic of a (1999) discuss the problems in using these tech- sound measurement, is the degree to which differ- niques and present a difference score technique ent judges agree in their scoring of each individ- that does not have the problems. With the Kane ual in a group. A fair test is one in which qualified and Lazarus technique: judges rate individuals similarly and/or offer the Each person has an initial (Dand a final same conditions of testing to all individuals equally. (F) score. Reliability of criterion-referenced test scores A change score (C) is calculated for each was also discussed. The definition of reliability person, where C-FI. and techniques for estimating reliability differ from those for norm-referenced test scores, Objective 2 Identify those factors that influence reliability and objectivity for norm-referenced test scores. 1. Acceptable reliability is essential for all measurements. Many factors affect the reliability of a measurement. Identify the conditions under which the highest reliability can be expected for the following factors: a. People tested b. Test length 108 Part II: Quantitative Aspects of Measurement into two groups and each group is tested by a dif- person. Probably in the actual measurement pro- ferent person. If a high degree of objectivity is lack- gram or research study a single judge will be used. ing because the two scorers of the test use different Thus the formula for estimating the reliability of a administrative procedures or scoring standards, a criterion score that is a single score must be used. person's score is dependent on the identity of the To calculate the objectivity coefficient, we scorer. If one scorer is more lenient than the other, think of the judges as trials, inserting their indi- the people tested by that scorer have an advantage.vidual scores into the trial terms of our reliability A high degree of objectivity is also needed when formulas. If all judges are supposed to be using the one person scores on several occasions. For exam- same standards, we could consider a difference ple, a scorer may measure one-third of a group on among judges to be measurement error and would each of 3 days, or the entire group at the beginning estimate objectivity using the one-way ANOVA and end of a teaching or training unit. In the first (Formula 4.2). case, it is essential that the same administrative If all judges are not expected to use the same+ procedures and scoring standards be used each standards, we would estimate objectivity using day. This is true in the second case as well, where the two-way ANOVA (Formula 4.4). Because any variation in a person's scores should represent objectivity is a type of reliability (rater reliability), changed performance, not changed procedures or information presented earlier in this chapter- standards. sample size for R. acceptable reliability, and so Some authors discuss objectivity as being on-generally applies when calculating and inter- intrajudge or interjudge. Initrajudge obiectivity is preting objectivity coefficients the degree of agreement between scores assigned to each person by one judge when viewing each person's performance on two different occasions. RELIABILITY OF For example, if people were videotaped as they CRITERION-REFERENCED performed a physical activity or test. a judge could view the videotape and score each person on each TEST SCORES of two different days. Interjudge objectivity is the degree of agreement between the scores assigned Based on a criterion-referenced standard, a person each person by two or more judges. Obiectivity is classified as either proficient or nonproficient. as defined and discussed in this chapter is inter either pass or fail. For example, in a Red Cross cer judge. Intrajudge objectivity is vital to interjudge tification course a person either meets the min imum requirements and is certified or does not bi hu each of 3 days, or the entire group at the beginning estimate objectivity using the one-way ANOVA and end of a teaching or training unit. In the first (Formula 4.2). case, it is essential that the same administrative If all judges are not expected to use the same procedures and scoring standards be used each standards, we would estimate objectivity using day. This is true in the second case as well, where the two-way ANOVA (Formula 4.4). Because any variation in a person's scores should represent objectivity is a type of reliability (rater reliability), changed performance, not changed procedures or information presented earlier in this chapter- standards. sample size for R. acceptable reliability, and so Some authors discuss objectivity as being on-generally applies when calculating and inter- intrajudge or interjudge. Intrajudge objectivity is preting objectivity coefficients. the degree of agreement between scores assigned to each person by one judge when viewing each person's performance on two different occasions. RELIABILITY OF For example, if people were videotaped as they CRITERION-REFERENCED performed a physical activity or test, a judge could view the videotape and score each person on each TEST SCORES of two different days. Interjudge objectivity is the Based on a criterion-referenced standard, a person degree of agreement between the scores assigned each person by two or more judges. Objectivity is classified as either proficient or nonproficient, + cither pass or fail. For example, in a Red Cross cer- as defined and discussed in this chapter is inter tification course a person either meets the min- judge. Intrajudge objectivity is vital to interjudge imum requirements and is certified or does not objectivity, but high intrajudge objectivity does meet the minimum requirements and is not certi- not guarantee high interjudge objectivity. In most fied. In another example, the criterion-referenced measurement situations, high intrajudge objectiv- standard for an adult fitness course could be the ity is assumed. Usually intrajudge objectivity will ability to jog continuously for 30 minutes. Based be higher than interjudge objectivity on this standard, a person is classified as either proficient or nonproficient. A very nice charac- Estimation teristic of a criterion-referenced standard is that To estimate the degree of objectivity for a phys- there is no predetermined quota as to how many ical performance test score in a pilot study two people are classified as proficient. or more judges score each person as he or she is Criterion-referenced reliability is defined dif tested. Then we calculate an intraclass correlation ferently from norm-referenced reliability. In the coefficient on the basis of judges scores of each enterion-referenced case, reliability is defined Chapter 4: Reliability and Objectivity 109 TABLE 4.10 Estimating Reliability of a Criterion-Referenced Test Day 2 TABLE 4.11 Data for Determining the Reliability of a Criterion-Referenced Test Pass Fail Trial 2 Day 1 Pass A B Pass Fail Fail C D Trial ! Pass 84 21 Fail 5 40 as consistency of classification. Thus, if a criteri- on-referenced test score is reliable, a person will be classified the same on each of two occasions. This could be trial-to-trial within a day or day-to-day. To estimate the reliability of a criteri on-referenced test score, create a double classification table, as presented in Table 4.10. In the A-box is the number of people who passed on both days and in the D-box is the number of people who failed on both days. Notice that the B-box and C-box are the numbers of people who were not classified the same on both occasions. Obviously, larger numbers in the A-box and D. box and smaller numbers in the B box and C box are desirable, because reliability is consistency of classification on both occasions. All packages of statistical computer programs have a cross-tabu- lation program that will provide the double classi fication table (see Table 4.10) needed to estimate the reliability of a criterion-referenced test. The most popular way to estimate reliability from this double-classification table is to calculate the proportion of agreeme (P), where AD A+B+C+D the surstof the four boxes is 150, P=0.83: 84 +40 P= 0.827 0.83 84+21+5+ 40 150 The proportion of agreement (P) does not allow for the fact that some same classifications on both occasions may have happened totally by chance. The Coccint (k) corrects for chance agreements. The formula for kappa is PaPc (4.9) 1-PC where Pa is the proportion of agreement and Pc is the proportion of agreement expected by chance (A+BYA+C)+(C+D)(B+D) P (A+B+C+D) in a 22 double classification table. Peables Determine the kappa coefficient (k) for the criterion referenced test scores using the data in Table 4.11. Solution WC TOLUCIOUS ko classified the same on each of two occasions. This could be trial-to-trial within a day or day-to-day. 84 +40 124 P- 0.827=0.83 To estimate the reliability of a criteri- 84+21+5+40 150 on-referenced test score, create a double- The proportion of agreement (P) does not classification table, as presented in Table 4.10. In allow for the fact that some same classifications the A-box is the number of people who passed on both days and in the D-box is the number of by chance. The kappa coefficient (k) corrects for on both occasions may have happened totally people who failed on both days. Notice that the chance agreements. The formula for kappa is B-box and C-box are the numbers of people who were not classified the same on both occasions. Pa-PC (4.9) Obviously, larger numbers in the A-box and D- 1-PC box and smaller numbers in the B-box and C-box where Pa is the proportion of agreement and Pe is are desirable, because reliability is consistency of classification on both occasions. All packages of the proportion of agreement expected by chance statistical computer programs have a cross-tabu- (A+B)(A+C)+(C+D)(B+D) lation program that will provide the double-classi- (A+B+C+D) fication table (see Table 4.10) needed to estimate the reliability of a criterion-referenced test. in a 2 x 2 double-classification table. The most popular way to estimate reliability Problem 4.8 from this double classification table is to calculate the proportion of agreement coefficient (P), where Determine the kappa coefficient (1) for the criterion-referenced test scores using the data in A+D Table 4.11 P 48 A+B+C+D Solution To solve fork, we use a three-step procedure: Problem 4.7 Step 1 Determine the proportion of agreement (P) for Calculate Pa the criterion-referenced test scores using the data A+D 84 +40 Pa in Table 4.11. A B D 84+21+5+40 Solution 124 Where the sum of the A-box and D boxi 124 and 0.827 0.83 150 110 Part II: Quantitative Aspects of Measurement Step 2 evaluating people in terms of their scores at the Calculate Pc: end of the program puts those who start the pro- Pc= (A+B)(A+C)(C+D)(B+D) gram with low scores at a disadvantage to people (A+B+C+D) who start the program with high scores. So, dif- ference scores are sometimes calculated to deter (105)(89)+(45)(61) 9345 +2745 mine the degree to which each person's score has (84+21+5+40) 150 changed over time--from the beginning to the 12,090 end of an instructional program or training pro- 0.537 =0.54 22,500 gram. A difference score can be quickly and easily calculated as Step 3 Calculate kappa: Difference score = Final score - Initial score Pa-Pc 0.83 -0.54 0.29 k= where the initial score is at the beginning and 0.63 1-PC 1-0.54 0.46 the final score is at the end of an instructional or training program . Notice that for the data in Table 1.11 there is Two real problems are posed by the use of a definite difference between the proportion of difference scores for purposes of comparison agreement (P) and the kappa coefficient (k). A among individuals, comparison among groups. more extensive discussion of both coefficients is and development of performance standards. First, presented in statistics books with a chapter on individuals who perform well initially do not have measurement theory the same opportunity to achieve large difference Because P can be affected by classifications by scores as individuals who begin poorly. For exam- chance, values of Pless than 0.50 are interpreted as ple, the person who initially runs a 6 minute mile unacceptable. Thus, values of P need to be closer has less opportunity for improvement than the to Lo than to 0.50 to be quite acceptable. Values of person who initially runs a 9-minute mile. Sec- k also should be closer to 10 than 0.0 to be ond, difference scores are tremendously unreli- quite acceptable. able. The formula for estimating the reliability of difference scores (YX) is as follows: RELIABILITY OF RRS-2R, 4.10 +5.-28.5 DESTRUENCE SCORES KE 0.63 1-PC 1 -0.54 0,46 the final score is at the end of an instructional or training program. Notice that for the data in Table 4.11 there is Two real problems are posed by the use of a definite difference between the proportion of difference scores for purposes of comparison agreement (P) and the kappa coefficient (k). A among individuals, comparison among groups, more extensive discussion of both coefficients is and development of performance standards. First, presented in statistics books with a chapter on individuals who perform well initially do not have measurement theory. the same opportunity to achieve large difference Because P can be affected by classifications by scores as individuals who begin poorly. For exam- chance, values of Pless than 0.50 are interpreted as ple, the person who initially runs a 6-minute mile unacceptable. Thus, values of P need to be closer has less opportunity for improvement than the to 1.0 than to 0.50 to be quite acceptable. Values of person who initially runs a 9-minute mile. Sec- k also should be closer to 1.0 than 0.0 to be quite able. The formula for estimating the reliability ond, difference scores are tremendously unreli- acceptable. of difference scores (Y - X) is as follows: R$+RS-2RS. RELIABILITY OF RA (4.10) 2R 5,5 DIFFERENCE SCORES where In the first edition of Baumgartner and Jackson X Initial score (1975) and the next two editions of the book, YFinal score difference scores were discussed. Difference scores Rand R Reliability coefficients for are sometimes called change scores or improve- tests X and Y ment scores. Many people are still using differ Correlation between tests X and Y ence scores, and Kane and Lazarus (1999) have Reliability of difference scores presented an improved technique for calculating and Standard deviations for the difference scores. So, difference scores are bricily initial and final scores discussed here If people start an instructional program or The following is an example of estimating training program with markedly different scores the reliability of difference scores. A test with R Chapter 4: Reliability and Objectivity 111 reliability of 0.90 was administered at the begin- 3. Using simple prediction, which is ning and end of a training program. As is usually available in the SPSS package of computer the case, the reliability of the test did not change programs, a change score (C) is predicted between administrations. The correlation between for each person with the independent or the two sets of scores was 0.75. The standard devi- predictor variable being the initial score ation was 2.5 for the initial scores and 2.0 for (1), so that C=(b)(I) + a. the final scores. Thus, the reliability of difference 4. A difference score (D) is calculated for scores is: each person, where D=C-C. 0.90(2.5) +0.90(2.0) - 2(0.75)(2.5)(2.0) A positive difference score is better than a nega- Ru (2.5)+(2.0) -2(0.75)(2.5)(2:0) tive difference score, because the change score (C) =0.63 is greater than was predicted. It is apparent from the formula that highly reli+ able measures between which the correlation is SUMMARY low are necessary if the difference scores are going to be reliable. Because difference scores are usually Three characteristics are essential to good measure- calculated from scores of a test administered at the ment: reliability, objectivity, and validity. Reliability beginning and end of an instructional or training and objectivity were the focus of this chapter. A mea- program, and the correlation between these two surement has reliability when it is consistently the sets of scores is normally 0.70 or greater, highly same for each person over a short period of time. Two reliable difference scores appear impossible, types of reliability are applicable to norm-referenced Other techniques for calculating difference tests: stability and internal consistency. scores have been presented. Kane and Lazarus Objectivity, the second vital characteristic of a (1999) discuss the problems in using these tech- sound measurement, is the degree to which differ- niques and present a difference score technique ent judges agree in their scoring of each individ- that does not have the problems. With the Kane ual in a group. A fair test is one in which qualified and Lazarus technique: judges rate individuals similarly and/or offer the Each person has an initial (Dand a final same conditions of testing to all individuals equally. (F) score. Reliability of criterion-referenced test scores A change score (C) is calculated for each was also discussed. The definition of reliability person, where C-FI. and techniques for estimating reliability differ from those for norm-referenced test scores