1. Introduction

Bees are animals and, as such, are inherently variable at the molecular, individual, and population levels. This intrinsic variability means that a researcher needs to separate the various sources of variability contained in the measurements, whether obtained by observational or experimental research, into signal and noise. The former may be due to treatments received, bee age, or innate differences in resistance. The latter is largely due to the genetic (and phenotypic expression of it) background that characterises individual living organisms. Statistics is the branch of mathematics we use to isolate and quantify the signal and determine its importance, relative to the inherent noise. For the researcher, with an eye toward the statistical analysis to come, and before data collection starts, one should ask:

1) Which variables (VIM, 2008) am I going to measure and what kind of data will those variables generate?

2) What degree of accuracy do I want to achieve and what is the corresponding sample size required?

3) Which statistical analysis will help me to answer my research question? This is related to the question. What kind of underlying process produces data like those I will be collecting?

4) From what population do I want to sample? (What is the statistical population/ statistical universe?) For example, do I want to make inferences about the local, national, continental, or worldwide population?

One function of statistics is to summarise information to make it more usable and easier to grasp. A second is inductive, where one makes generalisations based on a subset of a population or based on repeated observations (through replication or repeated over time). For example, if 50 workers randomly sampled from 20 colonies all produce 10-hydroxydecanoic acid (10-HDAA, one of the major components in the mandibular gland, especially in workers; Crewe, 1982; Pirk et al., 2011), one could infer that all workers produce 10-HDAA. An example of inferring a general pattern from repeated observations would be: If an experiment is repeated 5 times and yields the same result each time, one makes a generalisation based on this limited number of experiments. One should keep in mind that, if one is measuring a quantitative variable, irrespective of how precise measuring instruments are, each experimental unit/ replicate produces a unique data value. A third function of statistics is based on deductive reasoning and might involve statistical modelling, in the classical or Bayesian paradigm, to understand the basic processes that produced the measurements, possibly by incorporating prior information (e.g. predicting species distributions or phylogenetic relationships/trees; see Kaeker and Jones, 2003). In this article we will cover, albeit incompletely, all three functions of statistics.  We have largely focused on research with bee pathogens, in part because these are of intense practical and theoretical interest, and in part because of our own backgrounds. However, bee biology rightly includes a much greater spectrum of research, and for much of it there are specialised statistical tools.  Some of the ones we discuss are broadly applicable but, by necessity, this section can only provide an uneven treatment of current statistical methods that might be used in bee research.  In particular, we do not discuss multivariate methods, other than principal components, Bayesian approaches, and touch only lightly on simulation and resampling methods, all are current fields of investigation in statistics.  Molecular, and in particular, genomic research has spawned substantial new statistical methodology, also not covered here.

Furthermore, we restrict ourselves here to providing guidelines on statistics for certain kinds of honey bee research, as mentioned above, with referrals to more detailed sources of information. Fortunately, there are excellent statistical tools available, the most important of which is a good statistician. 

The statistics we describe can be roughly grouped into two main areas, one having to do with sampling to estimate population characteristics (e.g. for pathogen prevalence = proportion of infected bees in an apiary or a colony), and the other having to do with experiments (e.g. comparing treatments, one of which may be a control). Due to the complex social structure of a bee hive, and the peculiar developmental and environmental aspects of bee biology, sampling in this discipline has more components to consider than in most biological fields. Some statistical topics are relevant to both sampling and experimental studies, such as sample size and power. Others are primarily of concern for just one of the areas. For example, when sampling for pathogen prevalence, primary issues include representativeness, and how or when to sample. For experiments, they include hypothesis formulation and development of appropriate statistical models for the processes (which includes testing and assumptions of models). Of course, good experiments require representative samples, and also require a good understanding of sampling. Both areas are important for data acquisition and analysis. We start with statistical issues related to sampling.

1.2. Confidence level, Type I and Type II errors, and Power