Question

1 Approved Answer

Posted on Jul 05, 2024

5, Putting Together an Evaluation Matrix An evaluation plan is a written document that describes the cWill/matQuestions to ask yourself when putting together an evaluation

5, Putting Together an Evaluation Matrix An evaluation plan is a written document that describes the cWill/mat\Questions to ask yourself when putting together an evaluation matrix: Evaluation Questions Indicators Data Source Collection Methods Analysis Procedures . Are these clearly justified in . Is the indicator specific, . Where is information . For each indicator and source, . What data analysis software your evaluation narrative? observable, and on each question and what specific, detailed steps are will you use? (e.g., SAS, SPSS, . Are these questions relevant measurable? indicator available? (e.g., you going to take to gather data? NVivo, Excel) to your key stakeholders? . Does each question surveillance systems, (e.g., is a survey via phone or . How will you clean and prepare have an indicator to program documents, in- email? From what sample size? . Do your questions tie to your data for analysis? person interviews) Will it be piloted before it's sent key components of your reflect achievement? (e.g. With quantitative data, what program logic model? number of staff trained, . Will you be working to everyone? types of descriptive (frequencies, percent of health clinics from existing data, or . Can they feasibly be . If you're collecting from existing averages, percentages) and implementing a policy, collecting new data? data, what specific elements/ answered based on the inferential (t-tests, ANOVA, etc.) fields are going to be collected? available evaluation . Are there multiple regressions) statistics will you be resources? (time, budget, . Is there more than one questions that you can At what level? (e.g., line level using? indicator that may collect from one data data, or by provider group, staff, etc.) . With qualitative data, how will help answer a specific source? county, or region? . Will an answer to this you code the text by themes? question? . Do you need an account to question help meaningfully How easy/accessible is the collect or store data? (e.g., How would you rate the data guide actionable decision Is there baseline data data source? available? SurveyMonkey, RedCap) quality and robustness of your making and program findings improvements. . What qualitative aspects . Do you need to design a new of your question aren't collection instrument or revise What outputs (tables, graphs, charts) will you use to necessarily reflected in an existing tool? communicate your analyses? a particular indicator? (e.g., the nature of a partnership or the process by which a clinic offers screening) An evaluation matrix might also have the data collection time periods and persons responsible, but these are often more usefully displayed in a timeline format, such as a Gantt Chart. Depending on how novel or complex your data collection and analyses are, your project might benefit from an additional data analysis plan to detail your procedures. While a matrix provides a 100ft view of your evaluation process, the analysis plan provides a 10ft view and can help make sure you're both efficient and effective in your analysis. Sample evaluation plans can be found in Appendix D of the in the Practical Use of Program Evaluation among STD Programs guidebook (pp 331-354)Handbook ofPracficai Program Evaiaafion. Foan'h Edition By Kadlryn E. Newcomer, Harry P. Hatry and Joseph S. Wholey Copyright 2015 by Kadlryn E. Newcomer, Harry P. Hatry and Joseph S. Wholey '3'?' CHAPTER ONE PLANNING AND DESIGNING USEFUL EVALUATIONS Kathryn E. Newcomer, Harry P. Hatry, Joseph S. Wholey he demand for systematic data on the performance of public and non prot programs continues to rise across the world. The supply of such data rarely matches the level of demand of the requestors. Diversity in the types of providers of pertinent data also continues to rise. Increasingly. elected ofcials. foundations and other nonprot funders. oversight agencies. and citizens want to know what value is provided to the public by the programs they fund. Members of program staff want to know how their programs are performing so that they can improve them and learn from the information they gather. Increasingly. executives want to lead learn ing organizations. where staff systematically collect data. learn what works and does not work in their programs. and use this information to improve their organizational capacity and services provided. Leaders and managers also want to make evidencebased policy and management decisions. informed by data evaluating past program performance. As we use the term in this handbook. a Iorogram is a set ofresources and activ ities directed toward one or more common goals. typically under the direction of a singte manager or management tea-m. A program may consist of a limited set of activi ties in one agency or a complex set of activities implemented at many sites by two or more levels of government and by a set of public. nonprot. and even private providers. Handbook of Practical Program Evaluation Program evaluation is the application of systematic methods to address ques tions about program operations and results. It may include ongoing monitoring of a program as well as oneshot studies of program processes or program impact. The approaches used are based on social science research methodologies and professional standards. The eld of program evaluation provides processes and tools that agencies of all kinds can apply to obtain valid, reliable, and credible data to address a variety of questions about the performance of public and nonprot programs. Program evaluation is presented here as a valuable learning strategy for enhancing knowledge about the underlying logic of programs and the pro- gram activities under way as well as about the results of programs. We use the term practical program evaluation because most of the procedures presented here are intended for application at reasonable cost and without extensive involvement of outside experts. we believe that resource constraints should not rule out evaluation. Ingenuity and leveraging of expertise can and should be used to produce useful, but not overly expensive, evaluation information. Knowledge of how tradeoffs in methodological choices aect what we learn is critical. A major theme throughout this handbook is that evaluation, to be use- ful and worth its cost, should not only assess program implementation and results but also identify ways to improve the program evaluated. Although accountability continues to be an important goal of program evaluation, the major goal should be to improve program performance, thereby giving the public and funders better value for money. V'lhen program evaluation is used only for external accountability purposes and does not help managers learn and improve their programs, the results are often not worth the cost of the evaluation. The objective of this handbook is to strengthen program managers' and staff members' abilities to meet the increasing demand for evalua tion information, in particular information to improve the program evalu- ated. This introductory chapter identies fundamental elements that eval- uators and organizations sponsoring evaluations should consider before undertaking any evaluation work, including how to match the evaluation approach to information needs, identify key contextual elements shaping the conduct and use of evaluation, produce methodological rigor needed to support credible ndings, and design responsive and useful evaluations. A glossary of some key evaluation terms is provided at the end of this chapter. Planning and Designing Useful Evaluations 9 Matching the Evaluation Approach to Information Needs Selecting among evaluation options is a challenge to program personnel and evaluators interested in allocating resources efficiently and effectively. The value of program evaluation endeavors will be enhanced when clients for the information know what they are looking for. Clients, program managers, and evaluators all face many choices. Since the turn of the twenty-first century, the demand for evidence to inform policymaking both inside the United States and internationally has grown, as has the sophistication of the public dialogue about what qualifies as strong evidence. Relatedly, the program evaluation profession has grown in terms of both numbers and professional guidance. There are many influential organizations that provide useful standards for evaluation practice and iden- tify competencies needed in the conduct of evaluation work. Three key sources of guidance that organizations and evaluators should consult before entering into evaluation work include: . Joint Committee on Standards for Educational Evaluation (2010). This organiza- tion has provided four key watch words for evaluators for many years: utility, fea- sibility, propriety, and accuracy (see the committee's website, www.jcsee.org/ program-evaluation-standards, for more information on the standards). American Evaluation Association (2004). The AEA's Guiding Principles for Eval- uators is a detailed list of guidelines that has been vetted regularly by evalu- ators to ensure its usefulness (see www.eval.org/p/cm/ld/fid=51) Essential Competencies for Program Evaluators Self-Assessment at www.cehd.umn. edu/OLPD/MESI/resources/ECPESelfAssessmentInstrument709.pdf Select Programs to Evaluate Resources for evaluation and monitoring are typically constrained. Prioritize- tion among evaluation approaches should therefore reflect the most urgent information needs of decision makers. There may be many demands for infor- mation on program performance. Not all of these can likely be met at reason- able cost. What criteria can guide choices? Five basic questions should be asked when any program is being consid ered for evaluation or monitoring:10 Handbook of Practical Program Evaluation Can the results of the evaluation influence decisions about the program? Can the evaluation be done in time to be useful? Is the program significant enough to merit evaluation? Is program performance viewed as problematic? Where is the program in its development? One watchword of the evaluation profession has been utilization-focused evaluation (see Patton, 2008). An evaluation that is utilization-focused is designed to answer specific questions raised by those in charge of a program so that the information provided by these answers can affect decisions about the pro- gram's future. This test is the first criterion for an evaluation. Programs for which decisions must be made about continuation, modification, or termina- tion are good candidates for evaluation, at least in terms of this first criterion. Programs for which there is considerable political support are less likely can- didates under this criterion. Timing is important in evaluation. If an evaluation cannot be completed in time to affect decisions to be made about the program (the second criterion), evaluation will not be useful. Some questions about a program may be unan- swerable at the time needed because the data are not currently available and cannot be collected in time. Significance can be defined in many ways. Programs that consume a large amount of resources or are perceived to be marginal in performance are likely candidates for evaluation using this third test, assuming that evaluation results can be useful and evaluation can be done in a reasonable amount of time. The fourth criterion, perceptions of problems by at least some program stakeholders, matters as well. When citizens or interest groups publicly make accusations about program performance or management, evaluation can play a pivotal role. Evaluation findings and performance data may be used to justify decisions to cut, maintain, or expand programs in order to respond to the complaints. Placement of a program in its life cycle, the fifth criterion, makes a big dif- ference in determining need for evaluation. New programs, and in particular pilot programs for which costs and benefits are unknown, are good candidates for evaluation. Select the Type of Evaluation Once a decision has been made to design an evaluation study or a monitor- ing system for a program, there are many choices to be made about the type of approach that will be most appropriate and useful. Figure 1.1 displays six important continua on which evaluation approaches differ.Planning and Designing Useful Evaluations 11 FIGURE 1.1. SELECT AN EVALUATION APPROACH THAT IS APPROPRIATE GIVEN THE INTENDED USE. Formative 4 .. Summative Ongoing OneShot Objective Observers. .Participatory GoalOriented aP\"GoalFree" Quantitative4 rQualitative Ex Ante Post Program Problem Orientation NonProblem Formative evaluation uses evaluation methods to improve the way a pro- gram is delivered. At the other end of this continuum is su-m-mative evaluation, which measures program outcomes and impacts during ongoing operations or after program completion. Most evaluation work will examine program imple- mentation to some extent, if only to ensure that the assessment of outcomes or impacts can be logically linked to program activities. There are a variety of designs for formative evaluation, including implementation evaluation, Iarosess studies, and evaluabitity assessment, and they are covered later in this handbook. And there are a variety of specic designs intended to capture outcomes and impacts, and they are covered later in this text as well. The timing of the evaluation can range across a continuum from a one- shot study of a specic aspect of implementation or one set of outcomes to an ongoing assessment system. The routine measurement of program inputs, outputs, or intermediate outcomes may be extremely useful for assessment of trends and should provide data that will be useful for more focused one-shot studies. Traditional social science research methods have called for objective, neutral, and detached observers to measure the results of experiments and studies. However, as professional evaluation standards prescribe, program stakeholders should also be involved to ensure that the results of evaluation work of any kind will be used. The issue really is the level of participation of these stakeholders, who can include program staff, clients, beneciaries, funders, and volunteers, to name a few. For example, various stakeholders could be consulted or given some degree of decision-making authority in evaluation design, data collection, interpretation of ndings, and framing of recommendations. 12 Handbook of Practical Program Evaluation Evaluators make judgments about the value, or worth, of programs (Scriven, 1980) . When making determinations about the appropriateness, ade- quacy, quality, efciency, or effectiveness of program operations and results, evaluators may rely on existing criteria provided in laws, regulations, mission statements, or grant applications. Goals may be claried, and targets for per formance may be given in such documentation. But in some cases evaluators are not given such criteria, and may have to seek guidance from stakeholders, professional standards, or other evaluation studies to help them make judg- ments. \"then there are no explicit expectations for program outcomes given, or unclear goals are espoused for a program (i.e., it appears to be \"goal-free\"), evaluators nd themselves constructing the evaluation criteria. In any case, if the evaluators nd unexpected outcomes (whether good or bad) , these should be considered in the evaluation. The terms qualitative and quantitative have a variety of connotations in the social sciences. For example, a qualitative research approach or mind-set means taking an inductive and openended approach in research and broad ening questions as the research evolves. Qualitative data are typically words or visual images whereas quantitative data are typically numbers. The most common qualitative data collection methods are interviews (other than highly structured interviews] , focus groups, and participant observation. Open-ended responses to survey questions can provide qualitative data as well. The most common sources ofquantitative data are administrative records and structured surveys conducted via Internet and mail. Mixed-method approaches in evalu- ation are very common, and that means that both quantitative and qualitative data are used, and quantitative and qualitative data collection methods are used in combination (see Greene, 2007, for more on use of mixed methods). The extent to which an evaluation uses more quantitative or more qualitative methods and the relative reliance on quantitative or qualitative data should be driven by the questions the evaluation needs to answer and the audiences for the work. And nally, the relative importance of the primary reason for the evalua- tion matters. That is, are assumptions that problems exist driving the demand for the application of evaluation methods the driver? When evaluators are asked to investigate problems, especially if they work for government bodies such as the U.S. Government Accountability Ofce, state audit agencies, or inspector general ofces, the approaches and strategies they use for engag ing stakeholders, and collecting data may be different from those used by evaluators in situations in which they are not perceived as collecting data due to preconceptions of fault. Planning and Designing Useful Evaluations 13 Identify Contextual Elements That May Affect Evaluation Conduct and Use The context for employing evaluation matters. The context includes both the broader environment surrounding evaluation and the immediate situation in which an evaluation study is planned. Since the beginning of the twenty- rst century, daunting standards for evaluation of social programs have been espoused by proponents of evidencebased policy, management, and practice. Nonprot organizations have promoted the use of evaluation to inform pol- icy deliberations at all level of governments (For example, see Pew-MacArthur, 2014) . The Cochrane and Campbell Collaborations and similar organizations have given guidance that randomized controlled trials (RCTs) are the \"gold standard\" for evaluation. Yet, ethical prohibitions, logistical impossibilities, and constrained resources frequently do not allow random assignment of sub- jects in evaluation of some social services, and some government programs with broad public mandates, such as environmental protection and national security. In such situations, less sophisticated approaches can provide useful estimates of program impact. The key question facing evaluators is what type and how much evidence will be sufcient? \"till the evidence be convincing to the intended audiences be they nonprot boards, legislators, or the public? The stakes have risen for what constitutes adequate evidence, and for many social service providers the term evidencebased ItJvmctiice is intimidating. There is not full agreement in virn tually any eld about when evidence is sufcient. And funders are likely to be aware of the rising standards for hard evidence and some may be unrealistic about what can be achieved by evaluators operating with nite resources. It is usually difcult to establish causal links between program interven tions and behavioral change. Numerous factors affect outcomes. Human as well as natural systems are complex and adaptive; they evolve in ways that eval- uators may not be able to predict. Increasingly, attention has been drawn to using systems theory to inform evaluations ofinterventions designed to change behaviors in such complex systems. Programs are typically located in multicultural environments. Cultural competence (also discussed as cultural humility) is a skill that has become more crucial for evaluators to develop than ever before. There are many important differences across program stakeholders, and expectation for evaluators to understand and address these differences in their work are high. Adequate knowledge of the social, religious, ethnic, and cultural norms and values of program stakeholders, especially beneciaries who may present a large number of different backgrounds, presents another very important challenge to evaluators trying to understand the complex context in which a program 14 Handbook of Practical Program Evaluation operates. Evaluators need to understand the human environment of programs so that data collection and interpretation are appropriate and realistic. Chap- ter Twelve describes culturally responsive evaluation and provides guidance on incorporating cultural competency into evaluation work. Characteristics of the particular program to be evaluated can also affect the evaluation approach to be used. Evaluators may nd themselves working with program staff who lack any experience with evaluation or, worse, have had bad experiences with evaluation or evaluators. Many organizations are simply not evaluationfriendly. A compliance culture has grown up in many quarters in which funders' requirements for data have risen, and so managers and administrators may feel that providing data to meet reporting demands is simply part of business as usual but has nothing to do with organizational learning to improve programs (for example, see Dahler-Larsen, 2012). Finally, the operational issues facing evaluators vary across context. Chal lenging institutional processes may need to be navigated. Institutional review board processes and other clearances, such as the U.S. federal requirements for clearance of survey instruments when more than nine persons will be surveyed, take time and institutional knowledge. Sitespecic obstacles to obtaining records and addressing condentiality concerns can arise. Obtain- ing useful and sufcient data is not easy, yet it is necessary for producing quality evaluation work. Produce the Methodological Rigor Needed to Support Credible Findings The strength of ndings, conclusions, and recommendations about program implementation and results depends on well~founded decisions regarding eval~ uation design and measurement. Figure 1.2 presents a graphical depiction of the way that credibility is supported by the methodological rigor ensured by wise decisions about measurement and design. This section focuses rst on getting the most appropriate and reliable measures for a given evaluation and then on designing the evaluation to assess, to the extent possible, the extent to which the program being evaluated affected the measured outcomes. Choose Appropriate Measures Credible evaluation work requires clear, valid measures that are collected in a reliable, consistent fashion. Strong, well-founded measurement provides the foundation for methodological rigor in evaluation as well as in research and is the rst requirement for useful evaluation ndings. Evaluators must begin with credible measures and strong procedures in place to ensure that both quan- titative and qualitative measurement is rigorous. The criteria used to assess Planning and Designing Useful Evaluations 15 FIGURE 1.2. DESIGN EVALUATION STUDIES TO PROVIDE CREDIBLE FINDINGS: THE PYRAMID 0F STRENGTH. , Craft findings and recommendations Improve that are credible and credibility as supporlable levels increase Build a strong base 2 the rigor of quantitative and qualitative data collection, and inferences based on the two types of data, vary in terminology, but the fundamental similarities across the criteria are emphasized here. The validity or authenticity of measurement is concerned with the accuracy of measurement, so that the measure accurately assesses what the evaluator intends to evaluate. Are the data collection procedures appropriate, and are they likely to provide reasonably accurate information? (See Part Two for dis- cussions of various data collection procedures.) In practical evaluation endeav- ors, evaluators will likely use both quantitative and qualitative measures, and for both the relevance, legitimacy, and clarity of measures to program stake- holders and to citizens will matter. Often the items or concepts to measure will not be simple, nor will measurement processes be easy. Programs are com- posed of complex sets of activities to be measured. Outcomes to be measured may include both individual and group behaviors and may be viewed as falling on a short-term to long-term continuum, depending on their proximity to pro- gram implementation. Measures may be validated, that is, tested for their accuracy, through several different processes. For example, experts may be asked to comment on the face validity of the measures. In evaluation work the term experts means the persons with the most pertinent knowledge about and experience with the behaviors to be measured. They may be case workers involved in service delivery, they may be principals and teachers, or they may be the program's customers, who 16 Handbook of Practical Program Evaluation provide information on what is important to them. Box 1.1 provides tips for probing the validity and authenticity of measures. Box 1.1. Questions to Ask When Choosing Measures Are the measures relevant to the activity, process, or behavior being assessed? Are the measures important to citizens and public officials? What measures have other experts and evaluators in the field used? What do program staff, customers, and other stakeholders believe is important to measure? Are newly constructed measures needed, and are they credible? Do the measures selected adequately represent the potential pool of similar measures used in other locations and jurisdictions? Credibility can also be bolstered through testing the measures after data are collected. For example. evaluators can address the following questions with the data: ' Do the measures correlate to a specic agreedupon standard or criterion measure that is credible in the eld? ' Do the measures correlate with other measures in ways consistent with exist ing theory and knowledge? ' Do the measures predict subsequent behaviors in ways consistent with exist ing theory and knowledge? Choose Reliable Ways to Obtain the Chosen Measures The measures should be reliable. For quantitative data. reliability refers to the extent to which a measure can be expected to produce similar results on repeated observations of the same condition or event. Having reliable mea sures means that operations consistently measure the same phenomena and consistently record data with the same decision criteria. For example. when questions are translated into multiple languages for respondents of different cultural backgrounds. evaluators should consider whether the questions will still elicit comparable responses from all. Data entry can also be a major source of error. Evaluators need to take steps to minimize the likelihood of errors in data entry. For qualitative data. the relevant criterion is the auditabiiit}: of measure ment procedures. Auditability entails clearly documenting the procedures Planning and Designing Useful Evaluations 17 used to collect and record qualitative data. such as documenting the circum stances in which data were obtained and the coding procedures employed. See Chapter TwentyT 'o for more on coding qualitative data in a clear and credible manner. In order to strengthen reliability or auclitability of measures and mea surement procedures. evaluators should adequately pretest data collection instruments and procedures and then plan for quality control procedures when in the eld and when processing the information back home. (Also see Box 1.2.} Box 1.2. Tips on Enhancing Reliability - Pretest data collection instruments with representative samples of intended respondents before going into the field. - Implement adequate quality control procedures to identify inconsistencies in interpretation of words by respondents in surveys and interviews. - When problems with the clarity of questions are uncovered, the questions should be revised, and evaluators should go back to resurvey or re-interview if the responses are vital. - Adequately train observers and interviewers so that they consistently apply com- parable criteria and enter data correctly. - Implement adequate and frequent quality control procedures to identify obsta- cles to consistent measurement in the field. - Test levels of consistency among coders by asking all of them to code the same sample of the materials. There are statistical tests that can be used to test for intercoder and interobserver reliability of quantitative data. such as Cronbach's alpha. V's-Then statistical tests are desired. research texts or V\\-'eb sites should be consulted (for example. see the Sage Research Methods website at http://srmo.sagepub.com/view/encyclopediaofsurveyresearchmethods/ n228.xml}. Supporting Causal lnferences In order to test the effectiveness of programs. researchers must ensure their ability to make wellfounded inferences about (1} relationships between a pro gram ancl the observed effects (internal validity} and (2} generalizability or 18 Handbook of Practical Program Evaluation transferability of the ndings. With quantitative data this may include testing for the statistical conclusion validity of ndings. lntemal Validity Internal validity is concerned with the ability to determine whether a program or intervention has produced an outcome and to determine the magnitude of that effect. When considering the internal validity of an evaluation, the eval- uator should assess whether a causal connection can be established between the program and an intended effect and what the extent is of this relationship. Internal validity is also an issue when identifying the unintended effects (good or bad) of the program. When employing case studies and other qualitative research approaches in an evaluation, the challenge is typically to identify and characterize causal mechanisms needed to produce desired outcomes, and the term conrmability is more often applied to this process. When making causal inferences, evaluators must measure several ele- ments: ' The timing of the outcomes, to ensure that observed outcomes occurred after the program was implemented; ' The extent to which the changes in outcomes occurred after the program was implemented; and ' The presence of confounding factors: that is, factors that could also have produced desired outcomes. In addition, observed relationships should be in accordance with expecta- tions from previous research or evaluation work. It can be very difcult to draw causal inferences. There are several challenges in capturing the net impacts ofa program, because other events and processes are occurring that affect achieve- ment of desired outcomes. The time needed for the intervention to change attitudes or behavior may be longer than the time given to measure outcomes. And there may be aws in the program design or implementation that reduce the ability of the program to produce desired outcomes. For such reasons, it may be difcult to establish causation credibly. It may be desirable to use terms such as plausible attribution when drawing conclusions about the effects of pro- grams on intended behaviors. Box 1.3 offers tips about strengthening causal inferences about program results. Some evaluations may be intended to be relevant to and used by only the site where the evaluation was conducted. However, in other situations the eval- uation is expected to be relevant to other sites as well. This situation is discussed in the next section, on generalizing ndings. Planning and Designing Useful Evaluations 19 Box 1.3. Tips on Strengthening Inferences About Program Effects - Measure the extent to which the program was actually implemented as intended. - Ask key stakeholders about other events or experiences they may have had that also affected decisions relevant to the programbefore and during the evalua- tion time frame. - Given existing knowledge about the likely time period needed to see effects, explore whether enough time has elapsed between implementation of the pro- gram and measurement of intended effects. - Review previous evaluation findings for similar programs to identify external factors and unintended effects, and build in capacity to measure them. Generalizability Evaluation ndings possess generalizability when they can be applied beyond the groups or context being studied. \"lith quantitative data collection the ability to generalize ndings from a statistical sample to a larger population (or other program sites or future clients] refers to statistical conclusion valid- ity (discussed below}. For qualitative data, the tra-nsfm'abifity of ndings from one site to another (or the future) may present different, or additional, chal lenges. Concluding that ndings from work involving qualitative data are t to be transferred elsewhere likely require more extensive contextual understand ing of both the evaluation setting and the intended site for replication (see Cartwright, 2013 and Patton, 2011, for guidance on replicating and scaling up interventions) . All the conditions discussed previously for internal validity also need to be met for generalizing evaluation ndings. In addition, it is desirable that the evaluation be conducted in multiple sites, but at the least, evaluators should select the site and individuals so they are representative of the popula- tions to which the evaluators hope to generalize their results. Special care should be taken when trying to generalize results to other sites in evaluations of programs that may have differential effects on particu- lar subpopulations such as youths, rural groups, or racial or ethnic groups. In order to enhance generalizability, evaluators should make sampling choices to identify subpopulations of interest and should ensure that subsamples of the groups are large enough to analyze. However, evaluators should still examine each sample to ensure that it is truly representative of the larger population to which they hope to generalize on demographic variables of interest (for example, age or ethnic grouping). Box 1.4 offers tips about strengthening the generalizability of ndings. 20 Handbook of Practical Program Evaluation Statistical Conclusion Validity Statistical generalizability requires testing the statistical significance of findings from probability samples, and is greatly dependent on the size of the samples used in an evaluation. Chapter Twenty-Three provides more background on the use of statistics in evaluation. But it bears noting that the criterion of statistical significance and the tests related to it have been borrowed from the physical sciences, where the concern is to have the highest levels of con- fidence possible. In program evaluation practice, where obstacles may exist to obtaining large samples, it is reasonable to consider confidence levels lower than the 95 or 99 percent often used in social science research. For instance, it may be reasonable to accept a 90 percent level of confidence. It is entirely appropriate to report deliberations on this issue, reasons why a certain level was chosen, and the exact level of significance the findings were able to obtain. This is more realistic and productive than assuming that evaluation results will not be discussed unless a, perhaps unrealistically, high level of confidence is reached. Box 1.4. Questions to Ask to Strengthen the Generalizability of Findings . To what groups or sites will generalization be desired? What are the key demographic (or other) groups to be represented in the sam- ple? What sample size, with adequate sampling of important subgroups, is needed to make generalizations about the outcomes of the intervention? What aspects of the intervention and context in which it was implemented merit careful measurement to enable generalizability or transferability of findings? In order to report properly on an evaluation, evaluators should report both on the statistical significance of the findings (or whether the sample size allows conclusions to be drawn about the evaluation's findings), and on the importance and relevance of the size of the measured effects. Because statisti- cal significance is strongly affected by sheer sample size, other pertinent crite- ria should be identified to characterize the policy relevance of the measured effects. Reporting In the end, even careful planning and reasoned decision making about both measurement and design will not ensure that all evaluations willPlanning and Designing Useful Evaluations 21 produce perfectly credible results. There are a varietyof pitfalls that frequently constrain evaluation ndings. as described in Chapter TwentySix. Clarity in reporting ndings and open discussion about methodological decisions and any obstacles encountered during data collection will bolster condence in ndings. Planning a Responsive and Useful Evaluation Even with the explosion of quantitative and qualitative evaluation methodolo- gies since the 19705. designing evaluation work requires both social science knowledge and skills and cultivated professionaljudginent. The planning of each evaluation effort requires difcult tradeoff decisions as the evaluator attempts to balance the feasibility and cost of alternative evaluation designs against the likely benets of the resulting evaluation work. Methodological rigor must be balanced with resources. and the evaluator's professionaljudg ment will arbitrate the tradeoffs. V's'herever possible. evaluation planning should begin before the program does. The most desirable window ofopportunity for evaluation planning opens when new programs are being designed. Desired data can be more readily obtained ifprovision is made for data collection from the start of the program. particularly for such information as clients' preprogram attitudes and experi ences. These sorts of data might be very difcult. if not impossible. to obtain later. Planning an evaluation project requires selecting the measures that should be used. an evaluation design. and the methods of data collection and data analysis that will best meet information needs. To best inform choices. evaluators learn how the evaluation results might be used and how decision making might be shaped by the availability of the perfor mance data collected. However. it is important to recognize that evaluation plans are organic and likely to evolve. Figure 1.3 displays the key steps in planning and conducting an evaluation. It highlights many feedback loops in order to stress how important it is for evaluators to be responsive to changes in context. data availability. and their own evolving understanding of context. Planning Evaluation Processes Identication of the key evaluation questions is the rst. and frequently quite challenging. task faced during the design phase. Anticipating what clients need 22 Handbook of Practical Program Evaluation FIGURE 1.3. REVISE QUESTIONS AND APPROACHES AS YOU LEARN MORE DURING THE EVALUATION PROCESS. Pre-Evaluation Scoping Formulate Evaluation Feedback loops Objectives Frame Evaluation Questions Enhance reliability and validity of data Match methodology to Identify caveats questions Identify constraints on Ensure findings will implementing address information methodology needs Identify means to Ensure presentation ensure quality of work addressesaudience(s) Anticipate problems Reporting and develop contingency plan Report Design Data Preparation Collection/ Analysis to know is essential to effective evaluation planning. For example, the U.S. Gov- ernment Accountability Office (GAO) conducts many program evaluations in response to legislative requests. These requests, however, are frequently fairly broad in their identification of the issues to be addressed. The first task of GAO evaluators is to more specifically identify what the committees or members of Congress want to know, and then to explore what questions should be asked to acquire this information. (See Box 1.5 for more information on the GAO's evaluation design process.) Box 1.5. GAO's Evaluation Design Process Stephanie Shipman U.S. Government Accountability Office Each year, GAO receives hundreds of requests to conduct a wide variety of studies, from brief descriptions of program activities to in-depth evaluationssessments of program or policy effectiveness. Over time, GAO has drawn lessons from its experi- ence to develop a systematic, risk-based process for selecting the most appropriatePlanning and Designing Useful Evaluations 23 approach for each study. Policies and procedures have been created to ensure that GAO provides timely, quality information to meet congressional needs at reason- able cost; they are summarized in the following four steps: (I) clarify the study objectives; (2) obtain background information on the issue and design options; (3) develop and test the proposed approach; and (4) reach agreement on the pro- posed approach. Clarify the Study Objectives The evaluator's first step is to meet with the congressional requester's staff to gain a better understanding of the requester's need for information and the nature of the research questions and to discuss GAO's ability to respond within the desired time frame. Discussions clarify whether the questions are primarily descriptive- such as how often something occurSor evaluativeinvolving assessment against a criterion. It is important to learn how the information is intended to be used and when that information will be needed. Is it expected to inform a particular decision or simply to explore whether a topic warrants a more comprehensive examination? Once the project team has a clearer understanding of the requester's needs, the team can begin to assess whether additional information will be needed to formulate the study approach or whether the team has enough information to commit to an evaluation plan and schedule. In a limited number of cases, GAO initiates work on its own to address signif- icant emerging issues or issues of broad interest to the Congress. In these stud- ies, GAO addresses the same considerations in internal deliberations and informs majority and minority staff of the relevant congressional committees of the planned approach. Obtain Background Information GAO staff review the literature and other work to understand the nature and back- ground of the program or agency under review. The project team will consult prior GAO and inspector general work to identify previous approaches and recommen- dations, agency contacts, and legislative histories for areas in which GAO has done recent work. The team reviews the literature and consults with external experts and program stakeholders to gather information about the program and related issues, approaches used in prior studies, and existing data sources. Evaluators discuss the request with agency officials to explore their perspectives on these issues. GAO evaluators explore the relevance of existing data sources to the research questions and learn how data are obtained or developed in order to assess their completeness and reliability. Evaluators search for potential evaluative criteria in legislation, program design materials, agency performance plans, professional standards, and elsewhere, and assess their appropriateness to the research ( Continued) 24 Handbook of Practical Program Evaluation question, objectivity, suitability for measurement, and credibility to key program stakeholders. Develop and Test the Proposed Approach The strengths and limitations of potential data sources and design approaches are considered in terms of which ones will best answer the research questions within available resource and time constraints. Existing data sources are tested to assess their reliability and validity. Proposed data collection approaches are designed, reviewed, and pretested for feasibility given conditions in the field. Evaluators out- line work schedules and staff assignments in project plans to assess what resources will be required to meet the desired reporting timelines. Alternative options are compared to identify the trade-offs involved in feasibility, data validity, and the completeness of the answer likely to be obtained. Evaluation plans are outlined in a design matrix to articulate the proposed approach in table format for discussion with senior management (see Figure 1 .4 later in this chapter). The project team outlines, for each research question, the information desired, data sources, how the data will be collected and analyzed, the data's limitations, and what this information will and will not allow the eval- uators to say. Discussions of alternative design options focus on the implications that any limitations identified will have on the analysis and the evaluator's ability to answer the research questions. What steps might be taken to address (reduce or counterbalance) such limitations? For example, if the primary data source relies on subjective self-reports, can the findings be verified through more objective and reliable documentary evidence? Discussion of "what the analysis will allow GAO to say\" concerns not what the likely answer will be but what sort of conclusion one can draw with confidence. How complete or definitive will the answer be to the research question? Alterna- tively, one might characterize the types of statements one will not be able to make: for example, statements that generalize the findings from observed cases to the larger population or to time periods preceding or following the period examined. Reach Agreement on the Proposed Approach Finally, the proposed approach is discussed both with GAO senior management in terms of the conclusiveness of the answers provided for the resources expended and with the congressional requester's staff in terms of whether the proposed information and timelines will meet the requester's needs. GAO managers review the design matrix and accompanying materials to determine whether the pro- posed approach adequately addresses the requester's objectives, the study's risks have been adequately identified and addressed, and the proposed resources are appropriate given the importance of the issues involved and other work requests. The GAO team then meets with the requester's staff to discuss the engagement Planning and Designing Useful Evaluations 25 methodology and approach, including details on the scope of work to be per- formed and the product delivery date. The agreed-upon terms of work are then formalized in a commitment letter. Matching evaluation questions to a client's information needs can be a tricky task. When there is more than one client, as is frequently the case, there may be multiple information needs, and one evaluation may not be able to answer all the questions raised. This is frequently a problem for nonprot ser- vice providers, who mayneed to address multiple evaluation questions for mul- tiple funders. Setting goals for information gathering can be like aiming at a moving target, for information needs change as programs and environmental condi~ tions change. Negotiating evaluable questions with clients can be fraught with difculties for evaluators as well as for managers who may be affected by the ndings. The selection of questions should drive decisions on appropriate data col- lection and analysis. As seen in Figure 1.4, the GAO employs a design tool it calls the design matrix that arrays the decisions on data collection and analysis by each question. This brief, typically one-page blueprint for the evaluation is used to secure agreement from various stakeholders within the GAO, such as technical experts and substantive experts, and to ensure that answers to the questions will address the information needs of the client, in this case the con gressional requestor. Although there is no one ideal format for a design matrix, or evaluation blueprint, the use of some sort of design tool to facilitate com- munication about evaluation design among stakeholders is very desirable. An abbreviated design matrix can be used to clarify how evaluation questions will be addressed through surveying (this is illustrated in Chapter Fourteen). A great deal of evaluation work performed for public and nonprot programs is contracted out, and given current pressures toward outsourcing along with internal evaluation resource constraints, this trend is likely to con tinue. Contracting out evaluation places even more importance on identify ing sufciently targeted evaluation questions. Statements of work are typically prepared by internal program staff working with contract professionals, and these documents may set in stone the questions the contractors will address, along with data collection and analysis specications. Unfortunately, the con tract process may not leave evaluators (or program staff) much leeway in reframing the questions in order to make desired adjustments when the project gets under way and confronts new issues or when political priori ties shift. Efforts should be made to allow the contractual process to permit 26 Handbook of Practical Program Evaluation FIGURE 1.4. SAMPLE DESIGN MATRIX. Issue problem statement: Guidance: 1. Put the issue into context. 2. Identify the potential users. Scope and Methodology, What This Analysis Will Researchable Criteria and Information Including Data Likely Allow GAO to Question(S) Required and Source(s) Reliability Limitations Say Whatquestion(s) What information does the How will the team | What are the What are the expected is the team trying | team need to address the |answer each engagement's design results of the work? to answer? question? Where will they question? limitations and how get it will they affect the product? Question 1 Question 2 Question 3 Question 4 Source: U.S. Government Accountability Office. contextually-driven revisions. See Chapter Twenty-Nine for more guidance on effectively contracting out evaluation work. Balancing clients' information needs with resources affects selection of an evaluation design as well as specific strategies for data collection and analy- sis. Selecting a design requires the evaluator to anticipate the amount of rigor that will be required to produce convincing answers to the client's questions. Evaluators must specify the comparisons that will be needed to demonstrate whether a program has had the intended effects and the additional compar- isons needed to clarify differential effects on different groups. The actual nature of an evaluation design should reflect the objectives and the specific questions to be addressed. This text offers guidance on the wide variety of evaluation designs that are appropriate given certain objectives and questions to address. Table 1.1 arrays evaluation objectives with designs and also identifies the chapters in this text to consult for guidance on design. The wide range of questions that be framed about programs is matched by the variety of approaches and designs that are employed by professional evaluators. Resource issues will almost always constrain design choices; staff costs, travel costs, data collection burdens on program staff, and political and bureau- cratic costs may limit design options. Evaluation design decisions, in turn, affect where and how data will be collected. To help evaluators and programPlanning and Designing Useful Evaluations 27 TABLE 1.1. MATCHING DESIGNS AND DATA COLLECTION TO THE EVALUATION QUESTIONS. Evaluation Objective 1 . Describe program activities 2. Probe imple- mentation and targeting Illustrative Questions Who does the program affectboth targeted organizations and affected populations? What activities are needed to implement the program (or policy)? By whom? How extensive and costly are the program components? How do implementation efforts vary across delivery sites, subgroups of beneficiaries, andfor across geographical regions? Has the program (policy) been implemented sufficiently to be evaluated? To what extent has the program been implemented? When evidence-based interventions are implemented, how closely are the protocols implemented with fidelity to the original design? What key contextual factors tare likely to affect the ability of the program implementers to have the intended outcomes? What feasibility or management challenges hinder successful implementation of the program? Possible Design Performance Measurement Exploratory Evaluations Evaluability Assessments Multiple Case Studies Multiple Case Studies Implementation or Process evaluations Performance Audits Compliance Audits Corresponding Handbook Chapter(s) Chapter 4 Chapter 5 Chapter 8 Chapter 11 Chapter 12 Chapter 4 Chapter 8 Chapter 10 Chapter 11 Chapter 12 (Continued) 28 Handbook of Practical Program Evaluation TABLE 1.1. MATCHING DESIGNS AND DATA COLLECTION TO THE EVALUATION QUESTIONS. (Continued) Evaluation Objective Illustrative Questions To what extent have activities undertaken affected the populations or organizations targeted by the regulation? To what extent are implementation efforts in compliance with the law and other pertinent regulations? To what extent does current program (or policy) targeting leave significant needs (problems) not addressed? Has implementation of the program produced results consistent with its design (espoused purpose)? How have measured effects varied across implementation approaches, organizations, andfor jurisdictions? For which targeted populations has the program (or policy) consistently failed to show intended impact? Is the implementation strategy more (or less) effective in relation to its costs? Is the implementation strategy more cost effective than other implementation strategies also addressing the same problem? 3. Measure program impact Possible Design Experimental Designs, that is Random Control Trials (RCTs) Difference-in- Difference Designs Propensity Score Matching (PSM) Statistical Adjustments with Regression Estimates of Effects Multiple Time Series Designs Regression Discontinuity Designs Cost-Effectiveness Studies Benefit-Cost Analysis Systematic Reviews Meta-Analyses Corresponding Handbook Chapter(s) Chapter 6 Chapter 7 Chapter 25 (Continued) Planning and Designing Useful Evaluations TABLE 1.1. MATCHING DESIGNS AND DATA COLLECTION TO THE EVALUATION QU ESTIONS. (Continued) Evaluation Objective 4. Explain how and why programs produce intended and unintended effects Illustrative Questions What are the average effects across different implementations of the program (or policy)? How and why did the program have the intended effects? Under what circumstances did the program produce the desired effects? To what extent have program activities had important unanticipated negative spillover effects? What are unanticipated positive effects of the program that emerge over time, given the complex web of interactions between the program and other programs, and who benefits? For whom (which targeted organizations andfor populations) is the program more likely to produce the desired effects? What is the likely impact trajectory of the program (over time)? How likely is it that the program will have similar effects in other contexts (beyond the context studied)? How likely is it that the program will have similar effects in the future? Possible Design Multiple Case Studies Meta-Analyses Impact Pathways and Process Tracing Contribution Analysis Non-Linear Modeling, System Dynamics Configurational Analysis, e.g., Qualitative Case Analysis (QCA) Realist- Ba sed Synthesis Corresponding Handbook Chapter(s) Chapter 8 Chapter 25 30 Handbook of Practical Program Evaluation personnel make the best design decisions. a pilot test of proposed data collec- tion procedures should be considered. Pilot tests may be valuable in rening evaluation designs; they can clarify the feasibility and costs of data collection as well as the likely utility of different data analysis strategies. Data Collection Data collection choices may be politically as well as bureaucratically tricky. Exploring the use of existing data involves identifying potential political barri ers as well as more mundane constraints. such as incompatibility of computer systems. Planning for data collection in the eld should be extensive in order to help evaluators obtain the most relevant data in the most efcient manner. Chapters Thirteen through TwentyOne present much detail on both selecting and implementing a variety of data collection strategies. Data Analysis Deciding how the data will be analyzed affects data collection. for it forces eval- uators to clarify how each data element will be used. Collecting too much data is an error that evaluators frequently commit. Developing a detailed data anal ysispEa-n as part of the evaluation design can help evaluators decide which data elements are necessary and sufcient. thus avoiding the expense of gathering unneeded information. An analysis plan helps evaluators structure the layout of a report, for it identies the graphs and tables through which the ndings will be presented. Anticipating how the ndings might be used forces evaluators to think care- fully about presentations that will address the original evaluation questions in a clear and logical manner. Identifying relevant questions and answering them with data that have been analyzed and presented in a user-oriented format should help to ensure that evaluation results will be used. However. communicating evaluation results entails more than simply drafting attractive reports. If the ndings are indeed to be used to improve program performance. as well as respond to funders' requests. the evaluators must understand the bureaucratic and political con- texts of the program and craft their ndings and recommendations in such a way as to facilitate their use in these contexts. Using Evaluation Information The goal of conducting any evaluation work is certainly to make positive change. When one undertakes any evaluation work. understanding from the WN' 9'5"?- Planning and Designing Useful Evaluations 31 outset how the work may contribute to achieving important policy and pro- gram goals is important. Program improvement is the ultimate goal for most evaluators. Consequently, they should use their skills to produce useful, cons vincing evidence to support their recommendations for program and policy change. Box 1.6. Anticipate These Challenges to the Use of Evaluation and Performance Data . Lack of visible appreciation and support for evaluation among leaders . Unrealistically high expectations of what can be measured and "proven" . A compliance mentality among staff regarding collection and reporting of pro- gram data and a corresponding disinterest in data use Resistance to adding the burden of data collection to staff workloads Lack of positive incentives for learning about and using evaluation and data Lack of compelling examples of how evaluation findings or data have been used to make significant improvements in programs . Poor presentation of evaluation findings Understanding how program managers and other stakeholders view eval uation is also important for evaluators who want to produce useful informa- tion. Box 1.6 lists some fairly typical reactions to evaluation in public and non prot organizations that may make it clifcult for evaluators to develop their approaches and to promote the use of ndings (for example, see Hatry, 2006; Mayne, 2010; Newcomer, 2008; Pawson, 2013; and Preskill and Torres, 1999). Clear and visible commitment by leadership is always critical, as are incentives within the organization that reward use. The anticipation that evaluation will place more burdens on program staff and clients is a perception that evaluators need to confront in any context. The most effective evaluators are those who plan, design, and implement evaluations that are sufciently relevant, responsive, and credible to stimulate program or policy improvement. Evaluation effectiveness may be enhanced by efciency and the use of practical, lowcost evaluation approaches that encour- age the evaluation clients (the management and staffofthe program} to accept the ndings and use them to improve their services. Efforts to enhance the likelihood that evaluation results will be used should start during the planning and design phase. From the beginning, evalu ators must focus on mediating obstacles and creating opportunities to promote use. Box 1.7 provides tips for increasing the likelihood that the ndings will 32 9"!\" 10. ll. 12. Handbook of Practical Program Evaluation be used. Six of these tips refer to actions that need to be taken during evalua tion design. Evaluators must understand and typically shape their audiences' expectations, and then work consistently to ensure that the expectations are met. Producing methodologically sound ndings and explaining why they are sound both matter. Box 1.7. Tips on Using Evaluation Findings and Data . Understand and appreciate the relevant perspectives and preferences of the audience (or audiences!) to shape communication of evaluation findings and performance data. Address the questions most relevant to the information needs of the audience. Early in the design phase, envision what the final evaluation products should contain. Design sampling procedures carefully to ensure that the findings can be gener- alized to whomever or wherever the key stakeholders wish. Work to ensure the validity and authenticity of measures, and report on the efforts to