10.4.1. Logistic regression

Elementary analysis of the answers to the essential questions, regarding colony losses, can yield an estimate of the overall loss rate for the observations (beekeepers or operations) grouped together by a single factor (such as country, or involvement (or not) in commercial pollination). Comparing these loss rate estimates and confidence intervals for the loss rates can indicate differences between the groups and hence potential risk factors relating to the risk of colony loss. The overall loss rate is a problematic estimator when the contribution of multiple factors to the risk of loss has to be determined, since factor responses may be associated, not independent of each other. For example, commercial pollination is more common in certain countries than others. Larger scale beekeepers contribute more to the overall loss rate than smaller scale beekeepers.

A statistical approach that deals with the difficulties of overall loss rate and enables conclusions on how factors (bee race, pollination practices, size of operation, honey yield, location etc.) influence colony losses is regression analysis (see Zuur et al. (2009) and Pirk et al., 2013). In regression analysis, the numerical outcome of the essential questions (number of colonies lost, number of colonies alive or the calculated population at risk) is linked to the factors through a linear model. In the analysis of bee colony losses, many of the response variables of interest are positively skewed (having a long tail to the right) and so generalized linear regression models (GZLMs) are appropriate. These models assume that the observations yarise independently from a specified family of probability distributions, and independent variables or factors x j,i, j=1,…, k, are used to provide a set of linear predictors

equation46

such that g(µi)= ηi, where µi is the mean of yi, and the βi are model coefficients to be estimated. Using GZLMs requires the specification of an appropriate probability distribution for the response variable y and also an appropriate form for the link function g (Krzanowski, 1998; McCullagh and Nelder, 1983).

The dependent variable of interest, the loss rate, is binary in the nature of its components (the number of lost colonies divided by the number of colonies at risk makes up the loss rate). This property leads to models that use a binomial distribution for the dependent variable. Each colony can be regarded as a “Bernoulli trial” resulting in no loss or a loss (0 or 1 respectively), and the number of lost colonies for a beekeeper can be regarded as a “binomial trial” of a certain size n (total number of colonies at risk, or number alive before the winter rest period) with a certain probability (p) of any one colony being lost after winter (an “event”) and probability 1-p of the colony being alive after winter (a “non-event”). If x is the number of events per beekeeper, then the binomial probability distribution describing the probability of x events has the formula

equation47

with the mean value of x given by np and variance of x by np(1-p).

Groups of beekeepers or operations can be seen as series of binomial trials which vary in size, and also with different probabilities of an event, p. Hence it is of interest to model the probability of loss for (groups of) beekeepers or operations characterized by different values of the risk factors involved, such as country or operation size or migratory practice.

Probabilities cannot be used directly as a response variable in a classical linear regression model, as probabilities can only have values ranging from 0 to 1, whereas continuous response variables can have any value. The solution for this problem is moving from the probability to the “odds” (p/(1-p)) and calculating the logarithm of the odds, the “logit”, to be used as the dependent variable. The first step, taking the odds, removes the boundary of 1 as the odds can have any positive value, while taking the logit in the second step removes the boundary of 0 as the logarithm can be negative (for odds less than 1). A probability of 50% has an odds of 1 and a logit of 0, with negative and positive logits corresponding to probabilities of less than and more than 50% respectively.

Generalized linear models of this nature are called logistic regression models, and can be expressed in the form

equation48

where the βi are model coefficients to be estimated and x j,i, j=1,…, k, are the values of the k independent variables or factors used in the model for prediction of the log odds of loss for case i.

Substituting the values and the estimated parameters into the right hand side of the equation enables prediction of the log odds of an event for that beekeeper or operation or group of operations. If this gives a value y, then taking the inverse logit ey/(1+ey) gives the prediction of the probability pi itself.

Kleinbaum and Klein (2002), Hosmer and Lemeshow (2000) and Agresti (2002) give an in-depth explanation of the principles of logistic regression, their interpretation, and the construction of best fitting models.

When honey bee loss data are involved in the analysis, several specific characteristics of these data and their analysis have to be addressed, as are now described.