Panel Paper: Statistical Power When Adjusting for Multiple Hypothesis Tests

Saturday, November 5, 2016 : 3:50 PM
Kalorama (Washington Hilton)

*Names in bold indicate Presenter

Kristin E. Porter, MDRC


Researchers are often interested in testing the effectiveness of an intervention on multiple outcomes, for multiple subgroups, at multiple points in time, or across multiple treatment groups. The resulting multiplicity of statistical hypothesis tests can lead to spurious findings of effects. Multiple testing procedures (MTPs) are statistical procedures that counteract this problem by adjusting p-values for effect estimates upwards. When not using an MTP, the probability of false positive findings (Type I errors) increases dramatically with the number of tests. When using MTPs, this probability is reduced.

However, an important consequence of MTPs is a change in statistical power, which can be substantial. That is, compared to when the multiplicity problem is ignored, the use of MTPs changes the probability of detecting effects when they truly exist. Unfortunately, while researchers are increasingly using MTPs, they frequently ignore the power implications of their use when designing studies. Consequently, in some cases, sample sizes may be too small, and studies may be underpowered to detect effects as small as a desired size. In other cases, sample sizes may be larger than needed, or studies are powered to detect smaller effects than anticipated.  

In studies with multiplicity, alternative definitions of power exist and in some cases may be more appropriate. For example, instead one might consider 1-minimal power – the probability of detecting at least oneeffect of a particular size in a set of effects that truly exist. Similarly, one might consider ½ minimal power – the probability of detecting at least half of all effects of a particular size in a set of effects that truly exist. Also, one might consider complete power – the power to detect all effects in a set of effects that truly exist. The choice of definition of power depends on the objectives of the study and how success of the intervention is defined.  The choice of definition also substantially affects the overall extent of power.

This paper presents methods for estimating statistical power, for multiple definitions of statistical power, when applying any of five common MTPs – Bonferroni, Holm, single-step and step-down versions of Westfall-Young, and Benjamini-Hochberg. The paper also presents empirical findings on how power is affected by the use of MTPs. The extent to which studies are underpowered or overpowered varies by circumstances particular to a study, which may include one or more of the following: the definition of power, the number of tests, the proportion of tests that are truly null, the correlation between tests, the specified probability of making a Type I error, and the particular MTP used to adjust p-values. The paper explores all of these factors and discusses the implications for practice.

Full Paper: