# 10.4.1. Logistic regression

Elementary analysis of the answers to the essential questions, regarding colony losses, can yield an estimate of the overall loss rate for the observations (beekeepers or operations) grouped together by a single factor (such as country, or involvement (or not) in commercial pollination). Comparing these loss rate estimates and confidence intervals for the loss rates can indicate differences between the groups and hence potential risk factors relating to the risk of colony loss. The overall loss rate is a problematic estimator when the contribution of multiple factors to the risk of loss has to be determined, since factor responses may be associated, not independent of each other. For example, commercial pollination is more common in certain countries than others. Larger scale beekeepers contribute more to the overall loss rate than smaller scale beekeepers.

A statistical approach that deals with the difficulties of overall loss
rate and enables conclusions on how factors (bee race, pollination practices,
size of operation, honey yield, location etc.) influence colony losses is
regression analysis (see Zuur *et al.*
(2009) and Pirk *et al*., 2013). In
regression analysis, the numerical outcome of the essential questions (number
of colonies lost, number of colonies alive or the calculated population at
risk) is linked to the factors through a linear model. In the analysis of bee
colony losses, many of the response variables of interest are positively skewed
(having a long tail to the right) and so generalized linear regression models
(GZLMs) are appropriate. These models assume that the observations *y _{i
}*arise independently from a specified family of probability
distributions, and

*independent variables or factors x*

_{j,i}

*, j=1,…, k,*are used to provide a set of linear predictors

such that g(µ* _{i}*)=

**η**

*, where µ*

_{i}*is the mean of*

_{i}*y*, and the β

_{i}_{i}are model coefficients to be estimated. Using GZLMs requires the specification of an appropriate probability distribution for the response variable

*y*and also an appropriate form for the link function g (Krzanowski, 1998; McCullagh and Nelder, 1983).

The dependent variable of interest, the loss rate, is binary in the
nature of its components (the number of lost colonies divided by the number of
colonies at risk makes up the loss rate). This property leads to models that
use a binomial distribution for the dependent variable. Each colony can be
regarded as a “Bernoulli trial” resulting in no
loss or a loss (0 or 1 respectively), and the number of lost colonies for a
beekeeper can be regarded as a “binomial trial” of a certain size *n* (total number of colonies at risk, or
number alive before the winter rest period) with a certain probability (*p*) of any one colony being lost after
winter (an “event”) and probability *1-p*
of the colony being alive after winter (a “non-event”). If *x *is the number of events per beekeeper, then the binomial
probability distribution describing the probability of *x* events has the formula

with the mean value of *x* given by *np* and
variance of *x* by *np(1-p)*.

Groups of beekeepers or operations can be seen as series of binomial
trials which vary in size, and also with different probabilities of an event, *p*. Hence it is of interest to model the
probability of loss for (groups of) beekeepers or operations characterized by
different values of the risk factors involved, such as country or operation
size or migratory practice.

Probabilities cannot be used directly as a response variable in a
classical linear regression model, as probabilities can only have values
ranging from 0 to 1, whereas continuous response variables can have any value.
The solution for this problem is moving from the probability to the “odds” (*p/(1-p)*) and calculating the logarithm
of the odds, the “logit”, to be used as the dependent variable. The first step,
taking the odds, removes the boundary of 1 as the odds can have any positive
value, while taking the logit in the second step removes the boundary of 0 as
the logarithm can be negative (for odds less than 1). A probability of 50% has
an odds of 1 and a logit of 0, with negative and positive logits corresponding
to probabilities of less than and more than 50% respectively.

Generalized linear models of this nature are called *logistic regression models*, and can be
expressed in the form

where the β_{i} are model coefficients to be
estimated and x _{j,i}*, j=1,…, k,*
are the values of the *k* independent
variables or factors used in the model for prediction of the log odds of loss
for case *i*.

Substituting the values and the estimated parameters into the right
hand side of the equation enables prediction of the log odds of an event for
that beekeeper or operation or group of operations. If this gives a value *y*, then taking the inverse logit *e ^{y}/(1+e^{y})* gives
the prediction of the probability

*p*itself.

_{i}Kleinbaum and Klein (2002), Hosmer and Lemeshow (2000) and Agresti (2002) give an in-depth explanation of the principles of logistic regression, their interpretation, and the construction of best fitting models.

When honey bee loss data are involved in the analysis, several specific characteristics of these data and their analysis have to be addressed, as are now described.