Questions and Answers of Measurement Theory In Action

3. Are there alternatives to developing a new test?
4. What content will the test cover?
5. What is the test’s intended dimensionality?
6. What is the ideal format for the test?
7. Who will the test be administered to?
8. What type of responses will be required of test takers?
9. Should more than one form of the test be developed?
10. Who will administer the test?
11. What special training will be required of test users for administering or interpreting the test?
12. How long will the test take to administer?
13. How many items will compose the test?
14. What is the intended difficulty level of the test?
15. How will meaning be attributed to scores on the test?
16. What benefits will result from use of the test?
17. What potential harm could result from use of the test?
EXERCISE 4.3: COMPARING TWO MEASURES OF THE SAME CONSTRUCT OBJECTIVE: To illustrate the potential impact that test specifications can have on the development of measures of the same construct.For
1. Did both test developers define the construct in the same way? (Be sure to review Step 2 in the module overview before answering this question.)If not, identify the differences in the definitions
2. Did each measure use open-ended items, closed-ended items, or both?
3. Which item formats were used in each measure of the construct?
4. Is each measure intended to be individually administered, or can it be administered in a group setting?
5. Are the measures of similar length?
6. What is the intended population of each test? Does the difficulty of the items appear appropriate for this population?
7. Based on your responses to questions 1–6, do you feel one of the two measures might be a better measure of the construct? Explain whether you believe each test developer’s decisions regarding
1. CTT is a highly useful, though limited, framework for understanding test reliability. CTT posits that an individual’s observed score on a test is a combination of the individual’s true score
2. The desired magnitude of a reliability estimate depends upon the purpose of the testing. For research uses, a reliability estimate of .70 may be considered acceptable. For some applied purposes,
3. Longer tests are generally more reliable than shorter tests. The Spearman-Brown formula can be used to estimate the reliability of a test that is increased or decreased in length by a specific
2. In your own words, explain the concept of a true score.
3. What are some of the major assumptions of CTT?
4. What are some of the limitations of CTT?
5. How is reliability defined in terms of CTT?
9. How much can we shorten an existing measure and still maintain adequate reliability? (See Case Study 5.2.)
1. Caleb was unable to detect any pattern in the content of the multiplechoice items the he had gotten wrong. Does this challenge the reliability of the test in any way?
2. Explain why Caleb’s performance on the essay items might provide some small evidence of reliability for the exam.
3. Caleb appears to have a good understanding of the concept of a true score. How would you define error according to CTT?
4. Explain why, according to CTT, a student’s true score might differ depending upon which items among several alternatives the student answered.
5. How convincing is Dr. Zavala’s argument for not allowing students to choose among several alternative items? Explain your opinion.
1. Do you think an r XX ' = .74 for the optimism scale would be acceptable or unacceptable for the purpose described above? Explain.
2. Should Sheila have randomly selected which items to keep and which to delete? What other options did she have?
3. How else might Sheila maintain her reliability levels yet still maintain(or increase) the number of usable responses she obtains?
4. Why do you think Sheila is using .80 as her lower acceptable bound for reliability?
EXERCISE 5.1: IDENTIFYING CTT COMPONENTS OBJECTIVE: Correctly identify each component of CTT.Mark each of the following as observed score ( X ), true score ( T ), or error ( E )
1. During a timed, 2-hour exam, both of Celia’s mechanical pencils ran out of lead, causing her temporary distress and a loss of several minutes of precious time.
2. After completing an online measure, Shana was informed that she was in the 73rd percentile of extraversion.
3. Despite not knowing the content, Tomás provided totally lucky guesses to three out of five multiple-choice questions on a recent Business Ethics quiz.
4. After dedicating his life to science, Jerry repeatedly took the same 20-item IQ test every month for 15 years. A researcher then averaged Jerry’s IQ scores to derive an overall score.
5. Jeff gloated that the score on his Early Elementary Education final exam was five points higher than Stephanie’s score.
EXERCISE 5.2: EXAMINING THE EFFECTS OF THE SPEARMAN-BROWN PROPHECY FORMULA OBJECTIVE: To practice using the Spearman-Brown prophecy formula for estimating reliability levels.Using the Spearman-Brown
1. Different types of reliability emphasize different sources of measurement error. Consider the most relevant sources of error when choosing which type(s) of reliability should be reported.
2. Report inter-rater reliability when considering consistency across two raters, but report inter-rater agreement when the focus is on the degree to which the raters reported the same exact score.
3. Recognize the difference between a reliability coefficient and standard error of measurement. A reliability coefficient is the correlation between the scores of test takers on two independent
1. What are the different sources of error that can be assessed with classical test theory reliability analysis?
2. Which sources of error are of primary concern in test-retest reliability?In parallel forms and internal consistency reliability? In inter-rater reliability?
3. Which sources of error tend to decrease the reliability of a measure?Which source of error tends to lead to an overestimate of the reliability of a measure?
4. How is Cohen’s kappa different from the other forms of reliability?
5. Why are some authors (e.g., Cortina, 1993; Schmitt, 1996) cautious about the interpretation of coefficient alpha?
1. In terms of Table 6.1 , what type of reliability coefficient did Chad estimate?What source of error is being estimated?
2. Did Chad make the right interpretation of his negative reliability estimate?What else might cause a negative reliability estimate?
3. In practice, how does one know which items to recode and which to keep the same?
4. Both positively and negatively worded items are frequently included on tests. Assuming you recode the negatively worded items before you run your reliability analysis, will the inclusion of
1. Why is coefficient alpha such a commonly reported reliability estimate in psychology and education?
2. Provide additional examples of instances in which each type of reliability(test-retest, parallel forms, internal consistency, and inter-rater reliability) might be used.
3. In judging the reliability of judges’ ratings of a student research competition, would we be satisfied with inter-rater reliability as computed by a correlation coefficient, or would computation
4. Are there times when we might be interested in obtaining more than one type of reliability? Explain by providing an example.
1. Perform alpha, split-half, and parallel forms reliability analyses for each of the five scales. How do the three different types of reliability compare for each scale listed above? Is one form of
2. Using alpha reliability, with item and scale information, what items should be included in the final versions of each scale in order to maximize the alpha reliability for that scale? ( Note: You
3. For the life satisfaction and depression scales, determine if the alpha reliabilities are different for men and women (SEX). If yes, do you have any guesses why? ( Note: This requires using the
1. Develop a clear, complete definition of the content domain prior to test administration.
2. Select a diverse sample of SMEs that have undeniable knowledge of the content domain.
1. Validity is a unified construct. In what ways does content validity provide validity evidence?
3. The content approach to test validation relies heavily on expert judgment.Discuss the degree to which you feel it is appropriate to rely on judgment to provide evidence of validity.
4. Would content validity alone provide sufficient evidence of validity for(a) an employment exam, (b) an extraversion inventory, and (c) a test to determine the need for major surgery? In each case,
8. Consider a test or inventory of your choosing. If you wanted to examine the content validity of this measure, how would you go about choosing experts to provide judgments?
10. Imagine the case in which 14 SMEs were asked to provide CVR ratings for a five-item test. Compute the CVR for each of the items based on the ratings shown in Table 7.1 .
12. What is the CVI for the five-item test in question 10 prior to deletion of any items due to low CVR?
1. Are Juanita’s efforts sufficient to provide evidence of content validity?Explain.
2. To what degree does Juanita’s purpose for the test influence your response to question 1?
3. What additional sources might Juanita seek to help define the trait of loneliness?
4. Is Juanita’s choice of individuals to serve as SMEs appropriate? Explain.
5. Has Juanita used an appropriate number of SMEs? Explain.
6. How might Juanita identify other SMEs who would be useful in the content validation of her scale?
7. Is it possible that 33 items could capture the complexity of a construct such as the trait of loneliness?
1. Would seven SMEs serve as a sufficient number of expert raters to provide adequate evidence of content validity for this employment selection test? Why or why not?
2. Do the criteria Lester used for inclusion of an item seem appropriate?Defend your response.
3. Why would Lester be happy with a mean CVR rating of .78?
4. What other validation strategies might Lester have employed? What additional information would be needed to adopt a different validation strategy?
EXERCISE 7.1: IDENTIFYING SMEs OBJECTIVE: To gain practice identifying appropriate samples to provide content validation ratings.For each of the following tests, identify two different samples of
EXERCISE 7.2: ENSURING REPRESENTATIVE ASSESSMENT OF TEST DIMENSIONS OBJECTIVE: To consider the relative importance of various dimensions of a test.Given the limited number of items that can be
EXERCISE 7.3: DETERMINING THE CVI OF A MEASURE OF UNDERGRADUATE ACADEMIC WORK ETHIC OBJECTIVE: To gain experience obtaining and computing content validity ratings.INSTRUCTIONS: Below you will find a
1. Compute the CVR for each item on the scale.
2. Compute the CVI for the entire set of 20 items that make up the initial scale.
3. Based on statistical significance, Lawshe (1975) recommended that with 10 raters the CVR should be at least .62 to retain an item. Which items would be deleted using this criterion?
4. If the items identified in the preceding item were deleted, what would be the CVI of the remaining items?
1. When conducting criterion-related validation, choose the most relevant criterion, not necessarily the most easily measured
2. Attempt criterion-related validation only if the sample is sufficiently large(perhaps 200 or more) to provide a stable validity estimate.
3. Correct for artifacts that may lower the estimated criterion-related validity estimate.
1. What research design did Cecilia use to conduct her criterion-related validation study? What was the major determinant of this decision?
2. Why might Cecilia have preferred another research design for her criterion-related validation study?
3. Would Cecilia have to be concerned with criterion contamination in conducting this validation study? Explain.
4. Identify three alternate criteria that Cecilia might have used to assess job performance, rather than supervisor ratings. What concerns do you have with each possible criterion?
5. Given the sample used to validate the proposed selection tests, which correction formulas would be most important to use?

Showing 500 - 600 of 1226