Using Multi-Armed Experiments to Test “Improvement Versions” of Programs: When Beneficence Matters
*Names in bold indicate Presenter
We argue for a multi-armed trial as a response to a dilemma that arose from results of an experiment we conducted. In the preliminary study, results from exploratory analysis (not previously published and that will be presented at APPAM) showed a statistically significant differential impact of a STEM program, such that non-minorities received a modest positive impact, while minorities received zero benefit. The confirmatory analysis of impact using both groups combined was small, positive and statistically significant.
The result involving minorities is potentially highly consequential: if real, it implies the program increases an already existing performance gap, and thereby limits beneficence. However, standard canons of experimental research dictate replication of exploratory results before accepting them as true. The concern is that an exploratory finding can easily be spurious, if one of many, and identified post-hoc. A reasonable response is to develop and assess impact of an “improvement version” of the program instead of waiting to replicate the found negative effect to verify. The dilemma: each choice – replicating the standard program or developing an “improvement version” – carries a cost.
One cost, if the differential effect is in fact real, is an increase in the minority gap through a program’s continued implementation while waiting to corroborate the exploratory result. The other cost is from developing an improvement version where none is needed, should replication fail to confirm adverse impact. This is exacerbated if the new version ends up not being successful at eliminating the deficit, or is inferior for the inference group as a whole.
We propose that under the following three conditions, a socially responsible evaluation solution is to propose and conduct (if funded) a multi-armed trial that compares both the standard treatment, and an “improvement version” created to address the deficit, against business-as-usual:
1) When criteria for accepting differences in subgroup effects are met to a large extent, including (based the literature, to be discussed) the subgroups are pre-specified, the impact for the subgroup is different than for the rest of the sample, the impact for the full study sample is statistically significant, the secondary result is supported by pre-existing empirical and theoretical findings.
2) If the potential cost of negative consequences for the subgroup by not acting (e.g., social costs of driving up an achievement gap) are greater, than the costs of acting by developing a new version (e.g., creating a new version, when there is nothing wrong with the current one.)
3) If there is high level of flux in conditions of implementation, and also in the program itself, making the distinction between primary and secondary analyses blurrier. Under these conditions the program is tested anew with each iteration.