*Names in bold indicate Presenter
Design-replication studies assess the ability of non-experimental designs to replicate unbiased (experimental) estimators of program impact (e.g., Lalonde, American Economic Review, 1986). Our design-replication study uses, as a benchmark, a large-scale randomized field experiment that tested the effectiveness of norm-based messages designed to induce voluntary reductions in water consumption during a drought. Assuming no randomization or general equilibrium (spillover) biases, which were not detected, randomization of households into control and treatment groups, followed by a comparison of each group’s mean water consumption, provides an unbiased estimator of the average treatment effect. To our knowledge, our study is the first design-replication study in the environmental policy context, and the first to assess evaluation designs that use repeated observations before and after treatment. It is also the first, to our knowledge, that includes a treatment that had a large and statistically significant estimated treatment effect and another treatment that had a small and statistically insignificant effect.
Policies and programs are frequently implemented or piloted in administrative units like towns, counties, or states. To estimate impacts, evaluators typically look to neighboring administrative units for comparison groups and apply various statistical techniques to control for observable and unobservable sources of bias. To form a non-experimental comparison group, we use data on approximately 67,000 households from a neighboring county. The neighboring county experienced similar water pricing policies, water sources, weather patterns, state and metro regulatory environments and other regional confounding factors during the information campaign experiment. Participants do not self-select into the program, but they may have sorted themselves across counties based on characteristics that also affect water consumption. Our administrative data comprise monthly water use for 17 months, including pre and post experiment periods. By merging the treatment and non-experimental control group data with tax assessor and census data, we create a unique data set that also includes home characteristics and block group characteristics from the US census. We use bootstrapping methods to determine the sensitivity of a design’s performance to changes in the sample (failure to assess sensitivity has been a criticism of previous design-replication studies).
When matching methods are used to pre-process the data and make more similar the treated and control units’ pre-treatment outcome trends and distribution of baseline (time-invariant) covariates, simple fixed-effect panel data estimators generate estimates almost identical to the experimental estimates. Statistical inferences are also identical. However, using alternative designs (e.g., panel data estimators without pre-processing by matching; trimming based on propensity scores), the non-experimental estimators can be grossly inaccurate. The reasons for this inaccuracy are explored.