# 2.2.3. Extrapolating from sample to colony

A confidence interval of a statistical population parameter, for example, the mean detection rate in brood or the prevalence in the population/colony, can be estimated in a variety of ways (Reiczigel, 2003), most of which can be found in modern statistical software. We do not recommend using the (asymptotic) normal approximation to the binomial method; it gives unreasonable results for low and high prevalence. We show here Wilson’s score method (Reiczigel, 2003), defined as:

Equation V.

(2*N* + *z*^{2 }±* z*√{*z*^{2}
+ 4*N*(1 − )}) / 2(*N* + *z*^{2}),

where *N* is the sample size; is the observed proportion as used by Reiczigel (2003) to indicate that
it is an estimated quantity; and *z* is
the 1 – α/2 quantile, which can be defined as a critical value/threshold, from
the standard normal distribution. A shortcoming for all the methods, not only
Wilson’s method, is that they assume bees in a sample are independent of each
other (i.e. there is no over-dispersion, discussed below section 5.2.), which
is typically not true, especially given the transmission routes of bee
parasites and pathogens (for a detailed discussion of the shortcoming of all
methods of confidence interval calculation, see Reiczigel, (2003)). If the
degree of over-dispersion can be estimated, it can be used to adjust confidence
limits, most easily by replacing the actual sample size with the effective
sample size (if bees are not independent, then the effective sample size is
smaller than the actual sample size). One calculates the effective sample size
by dividing the actual sample size by the over-dispersion parameter (see
section 5.2.3., design effect or *deff*
and see Madden and Hughes (1999) for a complete explanation). The latter can be
estimated as a parameter assuming the data are beta-binomial distributed, but
more easily using software by assuming the distribution is quasi-binomial. The beta-binomial distribution is a true
statistical distribution, the quasi-binomial is not, but the theoretical
differences are probably of less importance to practitioners than the practical
differences using software. Estimating the parameters of the stochastic model
and/ or the distribution which will be used to fit the data, based on a
beta-binomial distribution (simultaneously estimating the linear predictor,
such as regression type effects and treatment type effects, and the other
parameters characterising the distribution), is typically difficult in today’s
software. On the other hand, there are standard algorithms for estimating these
quantities if one assumes the data are generated by a quasi-binomial
distribution. Essentially, the latter includes a multiplier (not a true
parameter) that brings the theoretical variance, as determined by a function of
the linear predictor, to the observed variance.
This multiplier may be labelled the over-dispersion parameter in
software output. The quasi-binomial distribution is typically in the part of
the software that estimates generalised linear models, and requires having bees
grouped in logical categories (e.g. based on age or location in a colony), and
there must be replication (e.g. two groups that get treatment A, two that get
treatment B, etc.). In this kind of analysis, for the dependent variable one
gives the number of positive bees and the total number of bees for each
category (for some software, e.g. in R, one gives the number of positive bees
and the number of negative bees for each category).

Prevalence (, estimated proportion positive in the population, as in section 2.2.1. and
2.2.2.) and a 95% confidence interval based on Wilson’s score method is given
in Fig. 4 for sample sizes (*N*) of 15,
30, and 60 bees. Note that, for the usual sample size of 30, there is still
considerable uncertainty about the true infection prevalence (close to 30% if
half the bees are estimated to be infected).

** Fig. 4.** Estimated
proportion of infected bees in a population as a function of the number of bees
diagnosed as positive () for various
sample sizes (

*N*= 15, 30, 60). Lower and upper limits for a 95% confidence interval are based on Wilson’s score method.