Panel Paper: Multi-Arm Tests of Welfare and Employment Programs

Judith Gueron and Gayle Hamilton, MDRC

In the early 1970s, it was an open question whether RCTs could be used to measure the effectiveness of operating welfare reform and employment programs. By the mid 1970s, the Supported Work Demonstration (SW) had shown that, under certain conditions, this was feasible and valuable. However, the experimental design for SW was simple: an experimental group in 10 sites was enrolled in the program, a control group was excluded. Although a more complex, multi-arm design was proposed as a means to disentangle the effectiveness of different aspects of the multi-dimensional program model, it was rejected as unrealistic.

Less than ten years later, a three-group, multi-arm design was successfully implemented to test the overall effectiveness of a multi-component welfare reform program operated in real-world conditions by social service agencies in a major county in California but also the effect of adding a specific component to the program. Why was this attempted? How was it structured? What explains the willingness of public agency staff to go along with this demanding experimental design that randomly assigned 7,000 people? What was learned and how were the findings received in California and Washington?

Less than another ten years later, a vastly more complex multi-group design was set up to test the relative effectiveness of different multi-component approaches (really different welfare reform strategies) operated by public agency staff in the same locations. Moreover, layered on this were multi-arm designs to test the impact of different messages regarding welfare requirements as well as a separate multi-arm design that examined the effect of staffing the same program in different ways. Overall, approximately 35,000 people were involved in multi-arm tests in this project. Surprisingly, this too was successful.

This paper will address what conditions and actions explain this evolution. Why did people attempt these designs? What did they seek to learn? What strategies were used to sell multi-arm designs? How did the researchers determine whether the different experimental “arms” were implemented in ways that truly represented the strategies intended to be tested? What explains the success or failure of implementation of the multi-arm designs ? Why and how did staff cooperate? What was learned and what does this experience suggest about the potential, requirements, and limitations of multi-arm RCTs?

After examining the evolution of designs and their increasing complexity, the authors will provide lessons on (at a minimum) the following issues:

  • What research, policy, and political questions and judgments drove these designs?
  • What conditions and actions facilitated the implementation of multi-arm designs?
  • What were the strengths and limitations of the different designs?
  • What questions remain about the feasibility, potential, and conditions for multi-arm tests?