How Do We Know If a Program Made a Difference? A Guide to Statistical Methods for Program Impact Evaluation

Tue, 08/23/2016 - 09:20

0 comments

Author

Peter Lance
Peter Lance
David K. Guilkey
Aiko Hattori
Gustavo Ángeles
Gustavo Angeles

It is our hope that this manual opens the door for the reader to a truly powerful set of tools that allow for us to understand the implications of human welfare programs, in the process providing crucial information about what works and what does not in shaping health, education, fertility and a slew of other outcomes that ultimately shape the length and happiness of our lives. Program design informed by such tools has the power to change our world."

From the United States Agency for International Development (USAID)-funded MEASURE Evaluation project, this manual provides an overview of core statistical and econometric methods for programme impact evaluation (and, more generally, causal modelling). More detailed and advanced than typical brief reviews of the subject, it also strives to be more approachable to a wider range of readers than the advanced theoretical literature on programme impact evaluation estimators. "It is impossible to eschew mathematics altogether in explaining effectively program impact evaluation methodologies...but we decided to adopt the simplest, most consistent mathematics (including at the mundane level of notation) possible."

The manual is designed for: public health professionals at programmes, government agencies, and non-governmental organisations (NGOs) who are the consumers of the information generated by programme impact evaluations; professionals serving the aforementioned role in any area of programming that influences human welfare; graduate students in public health, public policy, and the social sciences; technical staff at evaluation projects; journalists looking for a more nuanced understanding of the steady stream of impact (and, more broadly, causal) studies on which they are asked to report; analysts at health analytics organisations; and so on.

As explained here, the fundamental identification problem of programme impact evaluation is that we cannot observe the value of an outcome of interest for any given individual both when they participate and when they do not participate in a programme. We can only observe one of these for a particular individual. This makes it impossible to estimate programme impact at the individual level. A natural response is to shift focus from programme impact at the individual level to average impact at the population level. In principle, estimation of average impact would entail, essentially, comparison of outcomes across participants with those across non-participants. However, when participants and non-participants differ on average in their characteristics, it becomes impossible to say whether any differences in outcomes between the two groups is due to programme participation or differences in their other characteristics that might also generate differences in outcomes.

In that context, the manual covers 4 major traditions of programme impact evaluation methods designed to recover estimates of impact that reflect the causal impact of programme participation on an outcome of interest. "Empirical examples run through much of the manual to provide illustrations of these behavioral models and impact evaluation estimators using simulated data that will hopefully strengthen the link for the reader between behavior, data structure, impact evaluation models and model performance." The 4 traditions include:

Randomised Control Trials (RCTs) rely on a simple mechanism (randomisation) for generating estimates of programme impact. However, there are limitations. Randomisation can be tricky in practice and cannot always be tested. To the extent that the latter is true, the distinction between RCTs and quasi-experimental methods becomes blurred: Both essentially rely on assumptions in order to interpret programme impact estimates as the causal impact of programme participation on an outcome. Moreover, many interesting programmes cannot be evaluated (or, more broadly, many interesting causal questions cannot be addressed) by them because randomisation is not always feasible and, even when it is, some simple potential parameters of interest (such as median programme impact) cannot be identified. More broadly, there will always be lingering doubt that humans will ever passively accept their experimental assignment.
Selection on Observables Models essentially assume that we can observe all of the factors (background characteristics, constraints, environmental circumstances, etc.) that influence both the programme participation decision and the outcome of interest. The 2 major branches of this approach are regression and matching, though they are not really altogether distinct. In principle, both of these estimation approaches are simple, but, in practice, there appears to be considerable methodological disagreement about the specifics of implementation (though it is unclear how often these differences are important in practice).
Within Models assume that any factors that influence programme participation and the outcome of interest, and which we cannot observe, are somehow fixed. The classic example involves longitudinal data and assumes that such factors are fixed over time, but the basic approach has sometimes been applied in a strictly cross-sectional context (e.g., assuming that the unobserved confounding factor is constant across communities). For instance, with longitudinal data, every individual in some sense becomes a control for themselves. However, this estimation approach can actually worsen some types of bias (i.e. measurement error bias) and cannot be applied to many modelling circumstances of potential interest (e.g., the limited dependent variable options are limited).
Instrumental Variables, which rely on instruments that are correlated with random channels of programme participation, out of overall variation in participation (which might not be random). A valid instrument needs to be correlated with programme participation, have no independent role in determining the outcome of interest, and be uncorrelated with any unobserved determinants of programme participation. The latter 2 assumptions can generally be tested only in the case where there are multiple instruments per endogenous variable. Even when these assumptions regarding the instrument(s) are met, interpretation can be complicated by the possibility of local average treatment effects. Specifically, in the local average treatment effects, case instrumental variables generate a consistent estimate of programme impact only for the subpopulation whose programme participation is responsive to variation in the instrument. Thus, in this instance, instrumental variables do consistently identify programme impact for some subpopulation about which some can be learned (e.g., their proportion of the population and basic characteristics), but it remains unclear how a local average treatment effect might relate to, say, the average programme impact across the population.

"A major implicit message of this manual would seem to be that there is no 'Gold Standard' method of impact evaluation. All of the methods discussed involve assumptions, not all of which are testable, that allow one to interpret the estimates generated by them as program impact (or, more broadly, as reflecting a causal relationship). Some present inherent limitations in terms of the parameters that can be estimated. In the absence of a Gold Standard, it can be difficult to know what method is preferred. This is why careful impact evaluation work is so important. One must have a good sense of the institutional and environmental framework in which a program operates, as well as a good understanding of the design and procedures of the program itself and the types of populations motivated to participate and why they would be. This allows the evaluator to have an informed sense of what assumptions are (probably) reasonable, and hence which impact evaluation methods might be preferred and how much weight to assign to the estimates generated by them. It is true that even then assumptions (or, at the least, certainly untestable assumptions) are glorified opinions, but they will at least be informed opinions."

Additional related resources:

Click here to download "STATA do" files for the programmes behind the numerical examples in the manual. The first number of each do file indicates its associated chapter, and the second number indicates order within the chapter.
The MEASURE Evaluation project is hosting a series of webinar discussions in an effort to enable participants to understand the resources offered in the manual described above. Each webinar, designed to be a highly interactive learning opportunity, in the series reviews key topics from a chapter through verbal discussion and graphical presentation. The series also provides stand-alone training tools for the topics covered. Here are further details about the webinars and access to resources (e.g., slides) from them:
1. Held March 31 2006, the first webinar, entitled "Fundamentals of Program Impact Evaluation", addressed the basic challenges of programme impact estimation. Click here for more information.
2. The second webinar, held on June 29 2016 and entitled "Randomization and Its Discontents", considered programme impact estimation based on randomisation of programme participation status (typically though an RCT). Click here for more information.
3. The third webinar in the series, "Selection on Observables", is scheduled to take place September 13 2016 at 10am EDT and again on September 15 2016 at the same time. This webinar will consider impact evaluation estimation methods based on an identification strategy that assumes we can observe all factors that influence both programme participation and the outcome of interest. This 2-part webinar will examine the 2 most popular selection on observables estimation strategies: regression, and matching and related methods (e.g., propensity score-based weighting methods). Click here to register to attend and/or for more information.
4. Subsequent webinars will cover the major remaining quasi-experimental impact estimation techniques: within estimation (such as difference-in-differences) and instrumental variables methods.

Publishers

MEASURE Evaluation

Publication Date

2014-09-22

Number of Pages

346

Source

Posting from the IBP Consortium to The Communication Initiative on August 23 2016; and MEASURE Evaluation website, August 23 2016. Image credit: Jim Thomas, PhD

Legacy Partners

How Do We Know If a Program Made a Difference? A Guide to Statistical Methods for Program Impact Evaluation

Author

Publishers

Red de La Iniciativa de Comunicación

Soul Beat Africa Network

The Drum Beat Network