Question: PSYC 421 ITEM DEVELOPMENT AND ANALYSIS WORKSHEET Student Name: Section: PSYC421- PART 1: Writing Multiple Choice Test Items (Cohen et al., 2013, pg. 252) Develop

PSYC 421 ITEM DEVELOPMENT AND ANALYSIS WORKSHEET Student Name: Section: PSYC421- PART 1: Writing Multiple Choice Test Items (Cohen et al., 2013, pg. 252) Develop one multiple choice question that covers content from each of the four chapters listed below. When writing your sample questions, please keep in mind the specifications regarding item construction discussed in the textbook. Also, remember the importance of carefully crafted distractor options. Finally, please limit the number of response options to 4 (1 correct response and 3 distractors), and avoid the options of \"all of the above,\" none of the above,\" or the like. Be sure to indicate which of the response options is the correct one. Page 1 of 8 PSYC 421 Chapter 3 Multiple Choice Question (2.5 points) Chapter 4 Multiple Choice Question (2.5 points) Chapter 5 Multiple Choice Question (2.5 points) Chapter 6 Multiple Choice Question (2.5 points) PART 2: Item Analysis: Item Difficulty Index (Cohen et al., 2013, pg. 263) A test is only as good as its questions! When researchers, test constructors, and educators create items for ability or achievement tests, we have a responsibility to evaluate the items and make sure that they are useful and high-quality. The process that we use to evaluate test items is known as Item Analysis. When bad items are identified and eliminated from a test, that increases the efficiency, reliability and validity of the entire test! One way that we can distinguish among good and bad items is with the Item Difficulty Index. Part 2A: Calculating Item Difficulty Page 2 of 8 PSYC 421 Using the data below, calculate the Item Difficulty Index for the first 6 items on Quiz 1 from a recent section of PSYC101. For each item, \"1\" means the item was answered correctly and \"0\" means it was answered incorrectly. Type your answers in the spaces provided at the bottom of the table. (1 pt. each) PSYC101 Quiz 1 Item Distribution and Total Scores Examinee Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Total Score Andre 1 1 1 1 1 1 16 Allison 1 1 1 1 0 0 7 Heather 1 1 1 1 0 0 10 Corey 1 1 0 1 1 1 17 Christina 0 0 1 1 0 1 3 Jeffrey 0 1 1 1 0 0 11 Shawn 1 1 1 1 0 1 14 Dana 0 0 1 1 0 1 10 Megan 1 1 1 1 0 1 13 David 0 1 1 1 0 1 12 Isabel 0 0 0 1 0 0 4 Lance 1 1 1 1 0 0 9 Aliyah 1 1 1 1 0 1 15 Blaire 0 1 1 1 0 1 12 Gabriel 0 0 1 1 0 0 6 Item Difficulty Part 2B: Calculating Optimal Item Difficulty (.5 pt. each) 1. For a test item with two response options (e.g., true/false), what is the probability of selecting the correct answer by chance? % 2. Calculate the optimum level of difficulty for a test questions with two response options. Page 3 of 8 PSYC 421 3. For a test item with three response options, what is the probability of selecting the correct answer by chance? % 4. Calculate the optimum level of difficulty for a test questions with three response options. 5. For a test item with four response options, what is the probability of selecting the correct answer by chance? % 6. Calculate the optimum level of difficulty for a test questions with four response options. 7. For a test item with five response options, what is the probability of selecting the correct answer by chance? % 8. Calculate the optimum level of difficulty for a test questions with five response options. PART 3: Item Analysis: Item Discrimination Index (Cohen et al., 2013, pg. 265-266) Another way that test creators can distinguish between good and bad items is with an analysis called the Discrimination Index. The discrimination index measures how well an individual test item distinguishes between high scorers and low scores on the test. An item is considered to be \"good\" if most of the high scorers get it right, and most of the low scorers get it wrong. Interpreting the Discrimination Index (d) The discrimination index can range from -1.0 to 1.0. The closer d is to 1.0, the better the item discriminates between high and low scorers The closer d is to 0, the more poorly the item discriminates between high and low scorers. An item with a negative discrimination index is considered a \"negative discriminator\" because more low scorers get the item correct than high scorers. A discrimination index of 1.0 means all the high scorers got the item correct and all of the low scorers got it incorrect. A discrimination index of -1.0 means all of the low scorers got the item correct and all of the high scorers got it incorrect. Page 4 of 8 PSYC 421 Items with d's close to 0 or with negative d's ought to be eliminated from the test! Calculating the Item Discrimination Index (d) Calculate the item discrimination index (d) for the 7 hypothetical test items presented below. Type your answers in the spaces provided at the right of the table (1 pt. each). Item # U L n Item 1 21 17 25 Item 2 23 7 25 Item 3 25 0 25 Item 4 3 24 25 Item 5 22 3 25 Item 6 0 25 25 Item 7 19 6 25 d Based on your calculations above, answer the following questions (1 pt. each). 1. Which item discriminates the best? 2. Which item discriminates most poorly? 3. Based on your analysis, identify which two items would you choose to eliminate from this test and explain why you would eliminate each. Page 5 of 8 PSYC 421 Part 4: Item Characteristic Curves (Cohen et al., pg. 268-270) Another method that test creators can use to assess the usefulness of test items is with Item Characteristic Curves. Item characteristic curves provide a graphical depiction of examinees' performance on individual test items. As indicated in the figure below, Total Test Score is plotted on the x-axis of the graph, while proportion of examinees who got the item correct is plotted on the y-axis. Using the figure above, provide a written description of how test items A-D discriminate among examinees at various levels of performance. In your responses, discuss why each item would be considered a \"good\" or a \"bad\" item. EXAMPLE: \"This item discriminates well among high scores, but doesn't discriminate well among low scorers. So this item would be considered a good item because it discriminates at the highest levels of performance.\" (2 pt. each) Page 6 of 8 PSYC 421 Item A: Item B: Item C: Item D: Item E: Part 5: Qualitative Item Analysis (Cohen et al., pg. 272-274) Qualitative item analysis refers to a set of non-statistical procedures used to gather information about the usefulness of test items. These analyses typically involve interviews, panel discussions, questionnaires and other forms of verbal exchange with test-takers to explore how individual test items work. As an online student, you have a very different test-taking experience than residential students. Based on your readings from Chapter 8, identify 4 topics related to online test taking, and create 4 qualitative questions that you could ask online test-takers to gain an understanding of their experiences with testtaking. Also, as students at a Christian institution of higher education, course assignments/assessments are supposed to give students an opportunity to integrate course content with their Christian worldview. Given the topic of faith and learning, create one qualitative question that you could ask test-takers. Page 7 of 8 PSYC 421 Topic (1 pt. each) Sample Question for Test-Takers (1 pt. each) Assignment Scoring (Instructor Use Only) Part 1 Subtotal: /10 Part 2 Subtotal: /10 Part 3 Subtotal: /10 Part 4 Subtotal: /10 Part 5 Subtotal: /10 TOTAL SCORE: /50 Page 8 of 8 PG 252 For psychological tests designed to be used by school psychologists, interviews with teachers, administrative staff, educational psychologists, and others may be invaluable. Searches through the academic research literature may prove fruitful, as may searches through other databases. Considerations related to variables such as the purpose of the test and the number of examinees to be tested at one time enter into decisions regarding the format of the test under construction. Item format Variables such as the form, plan, structure, arrangement, and layout of individual test items are collectively referred to as item format. Two types of item format we will discuss in detail are the selected-response format and the constructed-response format. Items presented in a selected-response format require testtakers to select a response from a set of alternative responses. Items presented in a constructed-response format require testtakers to supply or to create the correct answer, not merely to select it. If a test is designed to measure achievement and if the items are written in a selected-response format, then examinees must select the response that is keyed as correct. If the test is designed to measure the strength of a particular trait and if the items are written in a selected-response format, then examinees must select the alternative that best answers the question with respect to themselves. As we further discuss item formats, for the sake of simplicity we will confine our examples to achievement tests. The reader may wish to mentally substitute other appropriate terms for words such as correct for personality or other types of tests that are not achievement tests. Three types of selected-response item formats are multiple-choice, matching, and true-false. An item written in a multiple-choice format has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options variously referred to as distractors or foils. Two illustrations follow (despite the fact that you are probably all too familiar with multiple-choice items). Stem Correct alt. Distractors Item A A psychological test, an interview, and a case study are: a. psychological assessment tools b. standardized behavioral samples c. reliable assessment instruments d. theory-linked measures Now consider Item B: A good multiple-choice item in an achievement test: a. b. c. d. e. f. has one correct alternative has grammatically parallel alternatives has alternatives of similar length has alternatives that fit grammatically with the stem includes as much of the item as possible in the stem to avoid unnecessary repetition avoids ridiculous distractors 252 Part 2: The Science of Psychological Measureme PG 263 The Item-Difficulty Index Suppose every examinee answered item 1 of the AHT correctly. Can we say that item 1 is a good item? What if no one answered item 1 correctly? In either case, item 1 is not a good item. If everyone gets the item right then the item is too easy; if everyone gets the item wrong, the item is too difficult. Just as the test as a whole is designed to provide an index of degree of knowledge about American history, so each individual item on the test should be passed (scored as correct) or failed (scored as incorrect) on the basis of testtakers' differential knowledge of American history.4 An index of an item's difficulty is obtained by calculating the proportion of the total number of testtakers who answered the item correctly. A lowercase italic \"p\" (p) is used to denote item difficulty, and a subscript refers to the item number (so p1 is read \"item-difficulty index for item 1\"). The value of an item-difficulty index can theoretically range from 0 (if no one got the item right) to 1 (if everyone got the item right). If 50 of the 100 examinees answered item 2 correctly, then the item-difficulty index for this item would be equal to 50 divided by 100, or .5 (p2 5 .5). If 75 of the examinees got item 3 right, then p3 would be equal to .75 and we could say that item 3 was easier than item 2. Note that the larger the item-difficulty index, the easier the item. Because p refers to the percent of people passing an item, the higher the p for an item, the easier the item. The statistic referred to as an itemdifficulty index in the context of achievement testing may be an item-endorsement index in other contexts, such as personality testing. Here, the statistic provides not a measure of the percent of people passing the item but a measure of the percent of people who said yes to, agreed with, or otherwise endorsed the item. An index of the difficulty of the average test item for a particular test can be calculated by averaging the item-difficulty indices for all the test's items. This is accomplished by summing the item-difficulty indices for all test items and dividing by the total number of items on the test. For maximum discrimination among the abilities of the testtakers, the optimal average item difficulty is approximately .5, with individual items on the test ranging in difficulty from about .3 to .8. Note, however, that the possible effect of guessing must be taken into items of the selected-response variety. With this type of item, the optimal average item difficulty is usually the midpoint between 1.00 and the chance success proportion, defined as the probability of answering correctly by random guessing. In a true-false item, the probability of guessing correctly on the basis of chance alone is 1/2, or .50. Therefore, the optimal item difficulty is halfway between .50 and 1.00, or .75. In general, the midpoint representing the optimal item difficulty is obtained by summing the chance success proportion and 1.00 and then dividing the sum by 2, or .511.0051.5 1.5 5 .60 2 JUSTTHINK... account when considering 4. An exception here may be a giveaway item. Such an item might be inserted near the beginning of an achievement test to spur motivation and a positive test-taking attitude and to lessen testtakers' test-related anxiety. In general, however, if an item analysis suggests t PG 265 The Item-Validity Index The item-validity index is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure. The higher the item-validity index, the greater the test's criterion-related validity. The item-validity index can be calculated once the following two statistics are known: the item-score standard deviation the correlation between the item score and the criterion score The item-score standard deviation of item 1 (denoted by the symbol s1) can be calculated using the index of the item's difficulty (p1) in the following formula: s15"p1 (12p1) The correlation between the score on item 1 and a score on the criterion measure (denoted by the symbol r1 C) is multiplied by item 1's item-score standard deviation (s1), and the product is equal to an index of an item's validity (s1 r1 C). Calculating the item-validity index will be important when the test developer's goal is to maximize the criterion-related validity of the test. A visual representation of the best items on a test (if the objective is to maximize criterion-related validity) can be achieved by plotting each item's item-validity index and item-reliability index (Figure 8-5). The Item-Discrimination Index Measures of item discrimination indicate how adequately an item separates or discriminates between high scorers and low scorers on an entire test. In this context, a multiple-choice item on an achievement test is a good item if most of the high scorers answer correctly and most of the low scorers answer incorrectly. If most of the high scorers fail a particular item, these testtakers may be making an alternative interpretation of a response intended to serve as a distractor. In such a case, the test developer should interview the examinees to understand better the basis for the choice and the t. PG 266 The first item to be one that you will predict will have a very high d, and the second to be one that you predict will have a high negative d. Suppose a history teacher gave the AHT to a total of 119 students who were just weeks away from completing ninth grade. The teacher isolated the upper (U) and lower (L) 27% of the test papers, with a total of 32 papers in each group. Data and item-discrimination indices for Items 1 through 5 are presented in Table 8-2. Observe that 20 testtakers in the U group answered Item 1 correctly and that 16 testtakers in the L group answered Item 1 correctly. With an item-discrimination index equal to .13, appropriately revise (or eliminate) the item. Common sense dictates that an item on an achievement test is not doing its job if it is answered correctly by respondents who least understand the subject matter. Similarly, an item on a test purporting to measure a particular personality trait is not doing its job if responses indicate that people who score very low on the test as a whole (indicating absence or low levels of the trait in question) tend to score very high on the item (indicating that they are very high on the trait in questioncontrary to what the test as a whole indicates). The itemdiscrimination index is a measure of item discrimination, symbolized by a lowercase italic \"d\" (d). This estimate of item discrimination, in essence, compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores. The optimal boundary lines for what we refer to as the \"upper\" and \"lower\" areas of a distribution of scores will demarcate the upper and lower 27% of the distribution of scoresprovided the distribution is normal (Kelley, 1939). As the distribution of test scores becomes more platykurtic (flatter), the optimal boundary line for defining upper and lower increases to near 33% (Cureton, 1957). Allen and Yen (1979, p. 122) assure us that \"for most applications, any percentage between 25 and 33 will yield similar estimates.\" The item-discrimination index is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly; the higher the value of d, the greater the number of high scorers answering the item correctly. A negative d-value on a particular item is a red flag because it indicates that low-scoring examinees are more likely to answer the item correctly than high-scoring examinees. This situation calls for some action such as revising or eliminating the item. Item 1 is probably a reasonable item because more U-group members than L-group members answered it correctly. The higher the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring testtakers. For this reason, Item 2 is a better item than Item 1 because Item 2's item-discrimination index is .63. The highest possible value of d is 11.00. This value indicates that all members of the U group answered the item correctly whereas all members of the L group answered the item incorrectly. If the same proportion of members of the U and L groups pass the item, then the item is not discriminating between testtakers at all and d, appropriately enough, will Table 8-2 Item-Discrimination Indices for Five Hypothetical Items Item U L U2L n d[(U2L)/n] 1 20 16 4 32 .13 2 30 10 20 32 .63 3 32 0 32 32 1.00 4 20 20 0 32 0.00 5 0 32 232 32 21.00 266 Part 2: The Science of Psychological Measure then +.5 0 .5 Best items for maximizing criterion- related validity Figure 8-5 Maximizing Criterion-Related 0 +.5 Validity Item-reliability index Source: Allen and Yen (1979). Chapter 8: Test Dev PG 268Alternatives Item 5 a b c d e U 14 0 0 5 13 L 7 0 0 16 9 is a poor item because more L group members than U group members answered the item correctly. Furthermore, none of the examinees chose the \"b\" or \"c\" distractors. Before moving on to a consideration of the use of item-characteristic curves in item analysis, let's pause to \"bring home\" the real-life application of some of what we have discussed so far. In his capacity as a consulting industrial/organizational psychologist, our featured test user in this chapter, Dr. Scott Birkeland, has had occasion to create tests and improve them with item-analytic methods. He shares some of his thoughts in his Meet an Assessment Professional essay, an excerpt of which is presented here. Item-Characteristic Curves As you may have surmised from the introduction to item response theory (IRT) that was presented in Chapter 5, IRT can be a powerful tool not only for understanding how test items perform but also for creating or modifying individual test items, building new tests, and revising existing tests. We will have more to say about that later in the chapter. For now, let's review how item-characteristic curves (ICCs) can play a role in decisions about which items are working well and which items are not. Recall that an item-characteristic curve is a graphic representation of item difficulty and discrimination. Figure 8-6 presents several ICCs with ability plotted on the horizontal axis and probability of correct response plotted on the vertical axis. Note that the extent to which an item discriminates high- from lowscoring examinees is apparent from the slope of the curve. The steeper the slope, the greater the item discrimination. An item may also vary in terms of its difficulty level. An easy item will shift the ICC to the left along the ability axis, indicating that many people will likely get the item correct. A difficult item will shift the ICC to the right along the horizontal axis, indicating that fewer people will answer the item correctly. In other words, it takes high ability levels for a person to have a high probability of their response being scored as correct. Now focus on the item-characteristic curve for Item A. Do you think this is a good item? The answer is that it is not. The probability of a testtaker's responding correctly is high for testtakers of low ability and low for testtakers of high ability. What about Item B; is it a good test item? Again, the answer is no. The curve tells us that testtakers of moderate ability have the highest probability of answering this item correctly. Testtakers with the greatest amount of abilityas well as their counterparts at the other end of the ability spectrumare unlikely to respond correctly to this item. Item B may be one of those items to which people who know too much (or think too much) are likely to respond incorrectly. Item C is a good test item because the probability of responding correctly to it increases with ability. What about Item D? Its ICC profiles an item that discriminates at only one point on the continuum of ability. The probability is great that all testtakers at or above this point will respond correctly to the item, and the probability of an incorrect response is great for testtakers who fall below that particular point in ability. An item such as D therefore has excellent discriminative ability and would be useful in a test designed, for example, to select applicants on the basis of some cutoff score. However, such an item might not be desirable in a test designed to provide detailed information on testtaker ability across all ability levels. This might be the case, for example, in a diagnostic reading or arithmetic test. 268 Part 2: The Science of Psychological Measurement PG 269 eet Dr. Scott Birkeland I also get involved in developing new test items. Given that these tests are used with real-life candidates, I place a high level of importance on a test's face validity. I want applicants who take the tests to walk away feeling as though the questions that they answered were truly relevant for the job for which they applied. Because of this, each new project leads to the development of new questions so that the tests \"look and feel right\" for the candidates. For example, if we have a reading and comprehension test, we make sure that the materials that the candidates read are materials that are similar to what they would actually read on the job. This can be a challenge in that by having to develop new questions, the test development process takes more time and effort. In the long run, however, we know that this enhances the candidates' reactions to the testing process. Additionally, our research suggests that it enhances the test's predictability. Once tests have been developed and administered to candidates, we continue to look for ways to improve them. This is where statistics comes into play. We conduct item level analyses of each question to determine if certain questions are performing better than others. I am often amazed at the power of a simple item analysis (i.e., calculating item difficulty and item discrimination). Oftentimes, an item analysis will flag a question, causing me to go back and Scott Birkeland, Ph.D., Stang Decision Systems, Inc. reexamine the item only to find something about it to be confusing. An item analysis allows us to fix those types of issues and continually enhance the quality of a test. Read more of what Dr. Birkeland had to sayhis complete essayat www.mhhe.com/ cohentesting8. Other Considerations in Item Analysis Guessing In achievement testing, the problem of how to handle testtaker guessing is one that has eluded any universally acceptable solution. Methods designed to detect guessing (S.-R. Chang et al., 2011), minimize the effects of guessing (Kubinger et al., 2010), and statistically correct for guessing (Espinosa & Gardeazabal, 2010) have been proposed, but no such method has achieved universal acceptance. Perhaps it is because the issues surrounding guessing are more complex than they appear at first glance. To better appreciate the complexity of the issues, consider the following three criteria that any correction for guessing must meet as well as the other interacting issues that must be addressed: 1. A correction for guessing must recognize that, when a respondent guesses at an answer on an achievement test, the guess is not typically made on a totally random basis. It is more reasonable to assume that the testtaker's guess is based on PG 270 High Item A Item B Item C Item D Low High Low High Low Chapter 8: Test Development 269 High Low Low Low Low Low Ability Ability Ability High High High High Ability Figure 8-6 Some Sample Item-Characteristic Curves For simplicity, we have omitted scale values for the axes. The vertical axis in such a graph lists probability of correct response in values ranging from 0 to 1. Values for the horizontal axis, which we have simply labeled \"ability,\" are total scores on the test. In other sources, you may find the vertical axis of an item- characteristic curve labeled something like \"proportion of examinees who respond correctly to the item\" and the horizontal axis labeled \"total test score.\" Source: Ghiselli et al. (1981). some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives. However, the individual testtaker's amount of knowledge of the subject matter will vary from one item to the next. 270 Part 2: The Science of Psychological Measurement Probability of Probability of Probability of Probability of correct response correct response correct response correct response PG 272 JUSTTHINK... Write an item that is purposely designed to be biased in favor of one group over another. Members of what group would do well on this item? Members of what group would do poorly on this item? debate is that items exhibiting significant difference in item-characteristic curves must be revised or eliminated from the test. If a relatively large number of items biased in favor of one group coexist with approximately the same number of items biased in favor of another group, it cannot be claimed that the test measures the same abilities in the two groups. This is true even though overall test scores of the individuals in the two groups may not be significantly different (Jensen, 1980). Establishing the presence of differential item functioning requires a statistical test of the null hypothesis of no difference between the item-characteristic curves of the two groups. The pros and cons of different statistical tests for detecting differential item functioning have long been a matter of debate (Raju et al., 1993). What is not a matter of Speed tests Item analyses of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be. This is because testtakers simply may not get to items near the end of the test before time runs out. In a similar vein, measures of item discrimination may be artificially high for late- appearing items. This is so because testtakers who know the material better may work faster and are thus more likely to answer the later items. Items appearing late in a speed test are consequently more likely to show positive item-total correlations because of the select group of examinees reaching those items. Given these problems, how can items on a speed test be analyzed? Perhaps the most obvious solution is to restrict the item analysis of items on a speed test only to the items completed by the testtaker. However, this solution is not recommended, for at least three reasons: (1) Item analyses of the later items would be based on a progressively smaller number of testtakers, yielding progressively less reliable results; (2) if the more knowledgeable examinees reach the later items, then part of the analysis is based on all testtakers and part is based on a selected sample; and (3) because the more knowledgeable testtakers are more likely to score correctly, their performance will make items occurring toward the end of the test appear to be easier than they are. If speed is not an important element of the ability being measured by the test, and because speed as a variable may produce misleading information about item performance, the test developer ideally should administer the test to be item-analyzed with generous time limits to complete the test. Once the item analysis is completed, norms should be established using the speed conditions intended for use with the test in actual practice. JUSTTHINK... Qualitative Item Analysis Test users have had a long-standing interest in understanding test performance from the perspective of testtakers (Fiske, 1967; Mosier, 1947). The calculation of item-validity, itemreliability, and other such quantitative indices represents one approach to understanding testtakers. Another general class of research methods is referred to as qualitative. In contrast to quantitative methods, qualitative methods are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures. Encouraging testtakers on a group or individual basisto discuss aspects of their test-taking experience is, in essence, eliciting or generating \"data\" (words). Provide an example of what, in your opinion is the best, as well as the worst, use of a speed test. 272 Part 2: The Science of Psychological Measurement PG 273 These data may then be used by test developers, users, and publishers to improve various aspects of the test. Qualitative item analysis is a general term for various nonstatistical procedures designed to explore how individual test items work. The analysis compares individual test items to each other and to the test as a whole. In contrast to statistically based procedures, qualitative methods involve exploration of the issues through verbal means such as interviews and group discussions conducted with testtakers and other relevant parties. Some of the topics researchers may wish to explore qualitatively are summarized in Table 8-3. Table 8-3 Potential Areas of Exploration by Means of Qualitative Item Analysis This table lists sample topics and questions of possible interest to test users. The questions could be raised either orally or in writing shortly after a test's administration. Additionally, depending upon the objectives of the test user, the questions could be placed into other formats, such as true-false or multiple choice. Depending upon the specific questions to be asked and the number of testtakers being sampled, the test user may wish to guarantee the anonymity of the respondents. Topic Cultural Sensitivity Face Validity Test Administrator Test Environment Test Fairness Test Language Test Length Testtaker's Guessing Testtaker's Integrity Testtaker's Mental/Physical State Upon Entry Testtaker's Mental/Physical State During the Test Testtaker's Overall Impressions Testtaker's Preferences Testtaker's Preparation Sample Question Did you feel that any item or aspect of this test was discriminatory with respect to any group of people? If so, why? Did the test appear to measure what you expected it would measure? If not, what was contrary to your expectations? Did the behavior of the test administrator affect your performance on this test in any way? If so, how? Did any conditions in the room affect your performance on this test in any way? If so, how? Do you think the test was a fair test of what it sought to measure? Why or why not? Were there any instructions or other written aspects of the test that you had difficulty understanding? How did you feel about the length of the test with respect to (a) the time it took to complete and (b) the number of items? Did you guess on any of the test items? What percentage of the items would you estimate you guessed on? Did you employ any particular strategy for guessing, or was it basically random? Do you think that there was any cheating during this test? If so, please describe the methods you think may have been used. How would you describe your mental state going into this test? Do you think that your mental state in any way affected the test outcome? If so, how? How would you describe your physical state going into this test? Do you think that your physical state in any way affected the test outcome? If so, how? How would you describe your mental state as you took this test? Do you think that your mental state in any way affected the test outcome? If so, how? How would you describe your physical state as you took this test? Do you think that your physical state in any way affected the test outcome? If so, how? What is your overall impression of this test? What suggestions would you offer the test developer for improvement? Did you find any part of the test educational, entertaining, or otherwise rewarding? What, specifically, did you like or dislike about the test? Did you find any part of the test anxietyprovoking, condescending, or otherwise upsetting? Why? How did you prepare for this test? If you were going to advise others how to prepare for it, what would you tell them? Chapter 8: Test Development 273 PG 274 One cautionary note: Providing testtakers with the opportunity to describe a test can be like providing students with the opportunity to describe their instructors. In both cases, there may be abuse of the process, especially by respondents who have extra-test (or extra-instructor) axes to grind. Respondents may be disgruntled for any number of reasons, from failure to prepare adequately for the test to disappointment in their test performance. In such cases, the opportunity to evaluate the test is an opportunity to lash out. The test, the administrator of the test, and the institution, agency, or corporation responsible for the test administration may all become objects of criticism. Testtaker questionnaires, much like other qualitative research tools, must be interpreted with an eye toward the full context of the experience for the respondent(s). \"Think aloud\" test administration An innovative approach to cognitive assessment entails having respondents verbalize thoughts as they occur. Although different researchers use different procedures (Davison et al., 1997; Hurlburt, 1997; Klinger, 1978), this general approach has been employed in a variety of research contexts, including studies of adjustment (Kendall et al., 1979; Sutton-Simon & Goldfried, 1979), problem solving (Duncker, 1945; Kozhevnikov et al., 2007; Montague, 1993), educational research and remediation (Munoz et al., 2006; Randall et al., 1986; Schellings et al., 2006), clinical intervention (Gann & Davison, 1997; Haaga et al., 1993; Schmitter-Edgecombe & Bales, 2005; White et al., 1992), and jury modeling (Wright & Hall, 2007). Cohen et al. (1988) proposed the use of \"think aloud\" test administration as a qualitative research tool designed to shed light on the testtaker's thought processes during the administration of a test. On a one-to-one basis with an examiner, examinees are asked to take a test, thinking aloud as they respond to each item. If the test is designed to measure achievement, such verbalizations may be useful in assessing not JUST THINK (ALOUD) . . . How might thinking aloud to evaluate test items be more effective than thinking silently? only if certain students (such as low or high scorers on previous examinations) are misinterpreting a particular item but also why and how they are misinterpreting the item. If the test is designed to measure personality or some aspect of it, the \"think aloud\" technique may also yield valuable insights regarding the way individuals perceive, interpret, and respond to the items. Expert panels In addition to interviewing testtakers individually or in groups, expert panels may also provide qualitative analyses of test items. A sensitivity review is a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations. Since the 1990s or so, sensitivity reviews have become a standard part of test development (Reckase, 1996). For example, in an effort to root out any possible bias in the Stanford Achievement Test series, the test publisher formed an advisory panel of twelve minority group members, each a prominent member of the educational community. Panel members met with the publisher to obtain an understanding of the history and philosophy of the test battery and to discuss and define the problem of bias. Some of the possible forms of content bias that may find their way into any achievement test were identified as follows (Stanford Special Report, 1992, pp. 3-4). Status: Are the members of a particular group shown in situations that do not involve authority or leadership? Stereotype: Are the members of a particular group portrayed as uniformly having certain (1) aptitudes, (2) interests, (3) occupations, or (4) personality characteristics? 274 Part 2: The Science of Psychological Measurement

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!