3.2.1. Power analyses and rules of thumb

Power (1-β) of a statistical test is its ability to detect an effect of a particular size (see section 3.1.3.), and this is intrinsically linked with sample size (N) and the error probability level (α) at which we accept an effect as being statistically significant (see section 1., Table 1). Once we know two of these values, it is possible to calculate the remaining one; in this case for a given α and β, what is N?  Power analyses can incorporate a variety of data distributions (normal, Poisson, binomial, etc.), but the computations are beyond the scope of this paper.  Fortunately, there are many freely available computer programs that can conduct these calculations (e.g. G*Power; Faul et al., 2007, the R-packages “pwr” and “sample size” online programs can be found at www.statpages.org/#Power) and all major commercial packages also have routines for calculating power and required sample sizes.

A variety of ‘rules of thumb’ exist regarding minimum sample sizes, the most common being that you should have at least 10-15 data points per predictor parameter in a model; e.g. with three predictors such as location, colony and infection intensity, you would need 30 to 45 experimental units (Field et al., 2012). For regression models (ANOVA, GLM, etc.), where you have k predictors, the recommended minimum sample size should be 50 + 8k to adequately test the overall model, and 104 + k  to adequately test each predictor of a model (Green, 1991). Alternatively, with a high level of statistical power (using Cohen’s (1988) benchmark of 0.8), and with three predictors in a regression model: i) a large effect size (> 0.5) requires a minimum sample size of 40 experimental units; ii) a medium effect size (of ca. 0.3) requires a sample size of 80; iii) a small effect size (of ca. 0.1) requires a sample size of 600 (Miles and Shevlin, 2001; Field et al., 2012). 

These numbers need to be considerably larger when there are random effects in the model (or temporal or spatial correlations due to some kind of repeated measures, which decreases effective sample size).  Random effects introduce additional parameters to the model, which need to be estimated, but also inflate standard errors of fixed parameters. The fewer the levels of the random effects (e.g. only three colonies used as blocks in the experiment), the larger the inflation will be. Since random factors are estimated as additional variance parameters, and one needs approximately 30 units to estimate a variance well, increasing the number of levels for each random effect will lessen effects on fixed parameter standard errors. That will also help accomplish the goals set in the first place by including random effects in a designed experiment: increased inference space and a more realistic partitioning of the sources of variation. We recommend increasing the number of blocks (up to 30), with fewer experimental units in each block (i.e. more, smaller blocks), as a general principle to improve the experimental design. Three (or the more common five) blocks is too few. Fortunately, there are open source (R packages “pamm” and “longpower”) and a few commercial products (software NCSS PASS, SPSS, STATISTICA) which could be helpful with estimating sample sizes for experiments that include random effects (or temporally or spatially correlated data). If random effects are considered to be fixed effects and one uses the methods described above for sample size estimation or power, required sample sized will be seriously underestimated and power seriously overestimated. The exemplary data set method (illustrated for GLMMs and in SAS code in Stroup (2013), though easily ported to other software that estimates GLMMs) and use of Monte-Carlo methods (simulation, example explained below, though it is not for a model with random effects) are current recommendations. For count data (binomial, Poisson distributed), one should always assume there will be over-dispersion.