Questions and Answers of Measurement Theory In Action

3. Are there any unique issues concerning the use of IRT for certification and/or licensing exams? If so, what are they?
4. Would the development and use of a test using IRT procedures be any different for the civil service exam (which typically rank orders job applicants) and the certification exam (which typically
5. Based on the information presented in the case study, does it appear that a new certification exam is the answer to the state’s reincarceration problem? What unique information do you think it
6. What would be the advantage of using IRT methods over CTT-IA procedures to develop the certification test in this instance?
7. Should the applicants and current incumbents be treated any differently in this situation?
EXERCISE 20.1: 1-PL (RASCH), 2-PL, AND 3-PL COMPUTER RUNS OBJECTIVE: To provide a brief introduction to common IRT programs by downloading a demo version and running 1-PL (Rasch), 2-PL, and 3-PL
1. The website http://www.sscicentral.com provides information on several IRT programs, including BILOG-MG, MULTILOG, PARSCALE, and IRTPro.The website has a student version of IRTPro available for
2. Once IRTPro is downloaded and installed on your computer, start the program.You should get a screen that looks like Figure 20.3a . From the menu, select “Open,” and from type of file choose
3. Examine the output file. There are lots of interesting bits of information there. You can see the a and b parameters for each item. Be aware that the column that has the c parameter should be
4. Now that you have estimated the 2-PL model for these 10 items, you can rerun the program using the 3-PL model and the 1-PL Rasch model.To rerun the 3-PL model, go back to the models section and
1. What are the basic descriptive statistics for the data? ( N , average number correct, number of items, etc.)
2. Which items were most and least discriminating, and how did that change across models?
3. If you had to choose just three items for a test that provided the most information at low ability, which would you choose? What about for high ability?
4. Which of the three models seems to best fit the data? What did you base your answer on?EXERCISE 20.2: IRT LITERATURE SEARCH OBJECTIVE: To become familiar with applications of IRT in the
1. CAT procedures work best when large numbers of unidimensional items can be written and it is possible to collect large amounts of data to calibrate the item response parameter estimates.
2. CATs are effective in that they provide superior measurement while being more efficient. In addition, they increase test security by providing different tests for each respondent. IRT-based DIF
3. What is the difference between a test that is simply administered on a computer(sometimes called CBT) and a computer adaptive test?
4. How do DIF procedures extend CTT-IA analyses?
5. What are the advantages and disadvantages of using non-IRT DIF versus IRT-based DIF?
1. If you were Scott, how would you go about explaining what CAT was to Gail?
2. What are some of the major differences between CAT and a paper-andpencil test that might highlight the advantages of CAT over paper-andpencil testing for Gail?
3. Are there other reasons that Tammy might have had to answer more questions than Gail? Is there a better way to explain this than what Scott said?
4. What other stopping procedures might CAT use to decide when to end the testing session besides a maximum number of items or a time limit?Will it be different for licensing exams versus other more
5. Are there other examples of the use of CAT that you can think of that might help Gail better understand what CAT is?
1. What additional information might the use of IRT procedures for DIF analysis provide that is not available with use of the Delta statistic?
2. Psychometricians at ETS no doubt know a lot about IRT procedures.In fact, ETS psychometricians were among the earlier pioneers in IRT research and development. Why, then, do you think they opted
3. As noted in the module overview, in order to perform IRT DIF procedures, item parameters need to be linked or equated across groups.Do you think such equating would also be required if other
EXERCISE 21.1: CAT ONLINE REVIEW OBJECTIVE: To become familiar with IRT and CAT through an investigative of current adaptive tests.BACKGROUND: In the module overview, we discussed the key elements of
1. Find two different adaptive tests and compare and contrast how these two tests explain the adaptive testing procedure to test takers.
2. If you were publishing your own adaptive test, how would you communicate the test-taking experience to test takers?
EXERCISE 21.2: ITEM BIAS/FAIRNESS REVIEW OBJECTIVE: To provide an opportunity to use item bias/fairness review to critically evaluate test items for possible bias.The web page
1. Did you find any questions that appear to demonstrate bias based on sex? If so, which items and on what basis do they appear to show bias?
2. Did you find any questions that appear to demonstrate bias based on race(Caucasian vs. African American)? If so, which items and on what basis do they appear to show bias?
3. Did you find any questions that appear to demonstrate bias based on ethnic group status (Caucasian vs. Hispanic)? If so, which items and on what basis do they appear to show bias?
EXERCISE 21.3: A CAT/DIF LITERATURE SEARCH OBJECTIVE: To become familiar with the CAT and DIF literature.Either individually or in small groups, perform a literature search to find a recent empirical
1. Brainstorm as many potential sources of error as possible that could influence your measure and then consider how you could collect data manipulating these sources of error.
2. Collect a comprehensive set of data, manipulating as many sources of variance as you can.
3. Estimate variance components to determine which sources of variance matter most and which matter little.
4. Decide on whether an absolute or relative D study is appropriate and compute the appropriate statistics. If you are unsure, compute the statistics for both types of decisions.
5. Based on the D study, update test recommendations to provide adequate levels of accuracy for test usage.
6. Realize that the test construction process is a continuous one. As a test sees expanded uses, consider that additional sources of error may be encountered.Develop future GT-based studies to
1. What are the main differences between GT and CTT?
3. What does the variance component estimate tell us?
4. What is the difference between a fixed and a random facet?
6. What is the purpose of a D study?
8. Why has GT been slow to be adopted in many research areas?
1. What are the factors of variation that you would compute for your G study? For each of these factors, would you consider that factor random or fixed? Is your design fully crossed, or are some
2. What statistics would you report back to your advisor to help her better understand whether the projective test is working well?
3. When conducting a D study, would you be most interested in absolute or relative decisions for this new measure?
4. Explain to your advisor how the information gathered from your GT analysis compares to information that could have been found from traditional CTT-based investigations of reliability as well as
5. Suppose that the GT study shows that there is reasonable consistency over time and raters. Develop a new GT-based study to investigate other sources of variation and error. Think of these other
1. What are the factors that you will analyze in your GT study? For each factor, decide whether it is fixed or random. Are any of the factors nested within each other?
2. What type of D study will you conduct to satisfy the VP-HR?
3. From a pragmatic perspective, which factors will be necessary to focus on in the D study?
4. The VP-HR remembers something about inter-rater reliability from his single testing class. He asks, “Why do we need to do this fancy design?Can’t we just correlate how my ratings compare to
5. Do you think the additional effort needed to conduct the GT study is worth it, compared to alternatives?
EXERCISE 22.1: REVIEW TWO G STUDIES OBJECTIVE: To observe how GT studies are reported in practice and to observe the different rationales that researchers use for motivating GT analyses.Find two
1. What was the rationale given for using GT? What rationale did the authors give for using GT analyses compared to other techniques discussed in this book?
2. What sources of error did the authors manipulate in their articles? Did they treat these factors as random or fixed?
3. If the authors conducted a D study, did they use relative or absolute criteria to guide the analyses?
4. What factors were found to be significant sources of error, and which were found to be trivial?
5. How did the authors use their GT findings to refine their instrument?
EXERCISE 22.2: EVALUATE COMPUTER PROGRAMS TO RUN GT ANALYSES OBJECTIVE: To gain some familiarity with popular GT analysis programs and to better understand some of the design features used to develop
1. Why types of designs can each piece of software best handle? How are the three different pieces of software distinguished from each other?
2. Exercise 22.1 presents a scenario where you have a series of items being responded to by two judges over a period of time. Which program would be best to analyze those data?
3. The Highhouse et al. project used one of these programs to analyze their data. Based on your understanding of their project, which of the three software programs would be most appropriate? (Hint:
There are a large number of personality differences that, to date, have little or no means of assessment. While some of these constructs are well defined, others suffer the disgrace of poor construct
Assume that you are currently working as a(n)a. clinical psychologist for a local parole board, orb. industrial/organizational psychologist for the U.S. Postal Service, orc. school psychologist
For this exercise, identify two different measures of the same construct. For example, you could identify two different measures of the personality construct, agreeableness. Also, obtain the test
Caleb knew he had to pay a visit to Dr. Zavala, the instructor of his psychometrics course. Caleb’s grade on the first exam was, shall we say, less than impressive. While his performance on the
Sheila was frustrated. Although she was happy with both the topic and the constructs she had chosen to examine in her senior honors thesis, she had hit several roadblocks in determining what measures
Mark each of the following as observed score (X), true score (T), or error (E) 1. During a timed, two-hour exam, both of Celia’s mechanical pencils ran out of lead causing her temporary distress
Using the Spearman–Brown prophecy formula provided in Case Study 5.2, estimate Sheila’s reliability for the dogmatism scale if she used only one third of the number of original items. Is this an
It didn't make sense. It just didn't. How could the reliability be so low? Chad scratched his head and thought. Chad had agreed to help analyze the data from his graduate advisor's most
Olivia and Gavin had been studying for their upcoming psychometrics midterm for over an hour, but the past few minutes had been less than productive. While the two thought they both had a good
Using the data set “Reliability.sav”, perform the reliability analyses outlined below. The scales provided here include a depression scale (14 items, V1–V14), a life satisfaction scale (10
What is the CVI for the five-item test in question 10 prior to deletion of any items due to low CVR?Question 10:Imagine the case in which 14 SMEs were asked to provide CVR ratings for a five-item
In her years of experience as a clinical therapist, Juanita had come to suspect that some of her clients seemed to share a common trait.Specifically, a significant portion of her clients expressed
Reflecting for a moment on the results of his ambitious undertaking, Lester smiled. His boss at the Testing and Personnel Services Division of this large midwestern city had assigned him the task of
For each of the following tests, identify two different samples of people who would have the expertise to serve as subject matter experts (SMEs) for providing judgments regarding the content validity
Given the limited number of items that can be included on a test or inventory, test developers must often make difficult decisions regarding the proportion of items that can be used to assess each
For this exercise, choose an appropriate sample of at least ten individuals to act as SMEs for this scale. Ask these SMEs to familiarize themselves with the proposed dimensions of the scale. Then ask
“This will be a cinch,” Cecilia had thought when she first received the assignment to conduct a criterion-related validity study. She’d been a human resource (HR) specialist at Joyco for only
Principal Andrew Dickerson of Mountain Central High School had a hunch. Actually, it was more like a strong suspicion. At nearly 15%, the student dropout rate in his high school was well above the
For each of the criteria presented in items 1–3, identify at least two psychological or cognitive measures that might serve as useful predictors in a criterion-related validation study. 1. Grades
Use the data set “Bus driver.sav” to correlate the possible predictors with the job performance measures and then answer the following questions. 1. Overall, how highly are the possible
A sales manager hoping to improve the selection process for the position of product sales compiled the data file “Sales.sav.” The manager administered several tests to her current employees and
Ever since learning about the concept in her psychology class, Khatera had known that she would complete her thesis on the construct of emotional intelligence. Since her initial introduction to the
DiAnn wasn’t too surprised to see Edgar arrive shortly after her office hours began. The material recently presented by the instructor in the psychological testing course for which she served as
Imagine that you have recently developed the following construct measures. For each of these newly developed instruments, identify two actual measures that could be used to examine the new
An industrial/organizational psychologist developed a personality based measure to assess the integrity of potential job applicants. The measure she developed was intended to mask the purpose of the
Assume you wanted to carry out a meta-analysis to determine how effective typing software is in improving typing speed and accuracy. What is the best way to get started in conducting such a meta
How do you decide which studies to include or exclude? What information to code?
Raul, a second-year master’s student, was very excited that his thesis committee had just approved his proposal to conduct a meta-analysis looking at what predicts employees’ satisfaction with
Ming-Yu, a new PhD student, was just given her first assignment as a graduate research assistant for Professor Riggs. Professor Riggs was a quantitative psychologist who studied the effects of sport
Individually or in small groups of three to five, students will select a topic on which to perform a meta-analysis. They will then outline, in detail, the steps to be taken if they were to actually
In Table 10.3, you will find data from the Albemarle Paper Co. v. Moody (1975) Supreme Court case. The case involved looking at the use of meta-analysis (more specifically, validity generalization)
Larry had just completed his master’s degree in industrial and organizational (I/O) psychology and obtained his first job with the human resources department of a large local school district. The
Joelle, a life span developmental psychology graduate student, recently completed an internship at an adult day care facility. The majority of the facility’s clientele were elderly individuals who
Using the data set of Mersman and Shultz (1998), recreate the results presented in Figures 11.3 and 11.4. In addition, recreate the regression equations presented in the module overview. Finally, run

Showing 900 - 1000 of 1226