5.6. Resampling techniques

Statistical methodology has benefited enormously from fast and ubiquitous computing power, with the two largest beneficiaries being methods that rely on numerical techniques, such as estimating parameters in GLMMs, and methods that rely on sampling, either from known distributions (such as most Bayesian methods, often called “Monte-Carlo” methods) or from the data (resampling or “bootstrapping”). Resampling techniques are essentially non-parametric, the only assumption is that the data are representative of the population you want to make inferences from. The data set must also be large enough to resample from, following the rules stated earlier for sample sizes for parametric models, (i.e. at least 10 observations per “parameter”, so a difference between two medians would require at least 20 observations). 

As a simple example, if we want to estimate a 95% confidence interval around a median, based on 30 observations, we can draw 100 random resampled data sets (with replacement) from the original data set, each of size 30, calculate the median for each of these resampled data sets, and rank those values. The 95% confidence interval is then the interval from the 5th to the 95th calculated median. Even though the original data set and the resampled data sets are the same size (n = 30), they are likely not identical because we are sampling with replacement, meaning that there will be duplicates (or even triplicates) of some of the original values in each resampled data set, and others will be missing.

Resampling can be used for statistical testing in a similar way. For example, if we want to know if the difference in medians between two data sets (each of size 30) is significant at α = 0.05, we could use the following approach. Take a random sample (with replacement) of size 30 from data set 1 and calculate its median, do the same for data set 2. Subtract the sample 2 median from the sample 1 median and store the value. Repeat this until you have 1,000 differences. Rank the differences. If the interval between the 50th and 950th difference does not contain zero, the difference in medians is statistically significant.

This general method can be applied to many common statistical problems, and can be shown to have good power (often better than a parametric technique if an underlying assumption of the parametric technique is even slightly violated). It can be used for both quantitative and qualitative (e.g. categorical) data, for example for testing the robustness of phylogenetic trees derived from nucleotide or amino acid sequence alignments, and is also useful as an independent method to check the results of statistical testing using other techniques. It does require either some programming skills or use of a statistical package that implements resampling techniques. 

If one writes a program, three parts are required. The first is used for creating a sample by extracting objects from the original data set, based on their position in the data file, using a random number generator. As a simple example, if there are five values, a random number generator (sampling with replacement) might select the values in positions (4, 3, 3, 2, 4).  Note that some positions are repeated, others are missing. That is fine because this process will be repeated 10,000 times, and, on average, all data values will have equal representation.  The second part is used for calculating the parameters of interest, for example, the median, and also is run 10,000 times. More complicated statistics take longer, and that will affect how long the program takes to complete. The third part stores the results of the second part, and may be a vector of length 10,000 (or a matrix with 10,000 rows, if several statistics are calculated from each resampled data set). Finally, summary statistics or confidence intervals are created, based on the third part. For example, if medians were calculated, one could calculate 90%, 95%, and 99% confidence intervals after ranking the medians and selected appropriate endpoints of the intervals. In general, 10,000 resampled data sets are considered to be a minimum to use for published results, though 500 are usually adequate for preliminary work (and that number is also useful for estimating how long it will take 10,000 to run).

All the major statistical software packages have resampling routines, and some rely almost exclusively on it (e.g. PASS, in the NCSS statistical software). We recommend the boot package in the R software, which is very flexible and allows one to estimate many of the quantities of interest for biologists (e.g. differences of means or medians, regression parameters). The classic book is Efron and Tibshirani (1993); Bradley Efron is the developer of the technique. A recent, less technical book is by Good (2013). A related technique is “jack-knifing”, where one draws all possible subsamples without replacement, typically of size n – 1, where n is the original sample size.