5.5. Robust statistics

Robust statistics were developed because empirical data that considered samples from normal distributions often displayed clearly non-normal characteristics, which invalidates the analyses if one assumes normality. They are usually introduced early on in discussions of measures of central tendency. For example, medians are far more resistant to the influence of outliers (observations that are deemed to deviate for reasons that may include measurement error, mistakes in data entry, etc.) than are means, so the former are considered more robust. Even a small number of outliers (as few as one) may adversely affect a mean, whereas a median can be resistant when up to 50% of observations are outliers. On the other hand, screening for outliers for removal may be subjective and difficult for highly structured data, where a response variable may be functionally related to many independent variables. If “outliers” are removed, resulting variance estimates are often too small, resulting in overly liberal testing (i.e. p values are too small).

What are the alternatives when one cannot assume that data are generated by typical parametric models (e.g. normal, Poisson, binomial distributions)? This may be a result of contamination (e.g. most of the data comes from a normal distribution with mean μ and variance σ12  but a small percentage comes from a normal distribution with mean μ and variance σ22, where σ22 >> σ12), a symmetric distribution with heavy tails, such as a t distribution with few degrees of freedom, or some highly skewed distribution (especially common when there is a hard limit, such as no negative values, typical of count data and also the results of analytic procedures estimation; e.g. titres). Robust statistics are generally applicable when a sampling distribution from which data are drawn is symmetric. “Non-parametric” statistics are typically based on ordering observations by their magnitude, and are thus more general, but have lower power than either typical parametric models or robust statistical models. However, robust statistics never “caught on” to any great degree in the biological sciences; they should be used far more often (perhaps in most cases where the normal distribution is assumed).

Most statistics packages have some procedures based on robust statistics; R has particularly good representation (e.g. the MASS package). All typical statistical models (e.g. regression, ANOVA, multivariate procedures) have counterparts using robust statistics. Estimating these models used to be considered difficult (involving iterative solutions, maximisation, etc.), but these models are now quickly estimated. The generalised linear class of models (GLM) has some overlap with robust statistics, because one can base models on, e.g. heavy-tailed distributions in some software, but the approach is different. In general, robust statistics try to diminish effects of “influential” observations (i.e. outliers). GLMs, once a sampling distribution is specified (theoretical sampling distributions include highly skewed or heavy-tailed ones, though what is actually available depends on the software package) consider all observations to be legitimate samples from that distribution. We recommend analysing data in several different ways if possible. If they all agree, then one might choose the analysis which best matches the theory (the sampling distribution best reflecting our knowledge of the underlying process) of how the data arose. When methods disagree, one must then determine why they differ and make an educated choice on which to use. For example, if assuming a normal distribution results in different confidence limits around means than those obtained using robust statistics, it is likely that there is serious contamination from outliers that is ignored by assuming a normal distribution. A recent reference on robust statistics is Maronna et al. (2006), while the classic one is Huber (1981).