Personal tools

# 9. Choice of sample size

In a probability-based sample, the sample size can be calculated statistically in order to achieve a required level of precision of estimates from the data collected, where these estimates have been identified in advance as being of interest. The formulae required depend on the sampling scheme to be used. Schaeffer et al. (1990) give details.

For example in a simple random sample, to estimate a mean, e.g. average number of colonies kept per beekeeper, to within a distance or error bound B of the correct value with approximately 95% confidence, the formula for the sample size is  where and  is the variance in the population of the quantity of interest, e.g. the number of colonies kept, and  is the population size. In the case of a very large population of beekeepers, where N is not known exactly, an approximation to this sample size is given by  . The population variance may be estimated from the variance calculated from data in a previous survey of the same population, or from a pilot survey. To estimate a total (by the population size times the sample average) with the same precision uses this same formula but with .  Box 10 provides an example of the calculations.

 Box 10. Sample size calculation for a survey to estimate a mean or a total. For example, using a simple random sampling approach, to estimate the average number of colonies kept to within a margin of error of 10% (B=0.10) of the true value with an approximate confidence level of 95%, the sample size is calculated as follows. We use the formula  where  . Assuming that the total number of beekeepers in the population is 1500, and if we have recent information from a previous survey that the variance  of the number of colonies per beekeeper is about 4, then we should sample  beekeepers, rounding up to the nearest integer. If we wished to estimate the total number of colonies kept, say to within 200 of the actual total with the same level of confidence, then making use of the same information, we calculate instead  0.00444, which now gives   beekeepers to be sampled.
To estimate a proportion p to within an error bound B of the true value with approximately 95% confidence, the same exact and approximate formulae are used as for estimating a mean, but with , so in the large population case . These formulae require an approximate value for  based on prior experience, or else substitution of a conservative value of

to maximise the required sample size. Box 11 shows the calculations.

 Box 11. Sample size calculation for a survey to estimate a proportion. For example, using a simple random sampling approach, to estimate an overall proportion of losses which was 20% last year (so p=0.20 approximately), to within a margin of error of 5% (B=0.05) of the true value with an approximate confidence level of 95%, the sample size is calculated as follows. The population size is assumed large, but is unknown.  So we use the large population version of the sample size formula for estimation of a proportion given by . Here this gives , giving  exactly. So the sample should be composed of at least 256 individuals to achieve the required level of precision.

If there is more than one quantity to be estimated, as there will be in surveys of beekeepers, the larger of the relevant calculated sample sizes can be used, where this is feasible, or it can be decided to focus on one more important estimator, e.g. the proportion of beekeepers experiencing winter colony loss or the proportion experiencing CDS losses. It is then accepted that any other estimates requiring a larger sample size will be estimated with lower precision than is desirable.

For a stratified sample, which takes simple random samples from each stratum, similar calculations may be done to obtain the overall sample size required to estimate the mean or total or proportion to within an error bound B of the true value with approximately 95% confidence. See Schaeffer et al. (1990), for example, for details.

Various approaches are possible to divide the chosen sample size between the strata, including the proportional method which takes the sample size  in the th stratum proportional to , where  is the size of the th stratum and  is the population size. This means taking , where  is the th stratum weight or the proportion of the population belonging to stratum .

Neyman allocation is a more complex method which splits the sample between strata in order to minimise the variance of the unbiased estimator of the population mean (given by , where where and  is the mean of the sample from stratum ) or of the total (taken as times the estimator for the mean) by taking the th stratum sample size  proportional to  or , where  is the variance within stratum and  is is the standard deviation the variance within stratum . So

The within stratum variances may be estimated from previous experience or a pilot survey.

To estimate a proportion (by , where  is the sample proportion in stratum ), the same formula can be used for allocation as for estimating a mean, but  is replaced by  where  is the value of the population proportion in stratum  (and in practice an estimate of this is used).

The Neyman approach can also be modified, if required, to incorporate different sampling costs for each stratum. More complex modified Neyman allocation schemes are also possible (Särndal et al., 1992).

More generally it may be decided, in order to achieve a suitable coverage of the population, that a fixed percentage of the population should be sampled. For some of the COLOSS surveys, a guideline for acceptable coverage has been that, where possible, at least 5% of beekeepers should be surveyed. This is a simple way to choose sample size, especially in a non-probability sample for which sample size calculations are not valid.

Another concern in a smaller population which may be surveyed repeatedly is not to overburden individuals, but to maintain goodwill. This may mean taking a smaller sample than is ideal. Data processing concerns may also limit the sample size.

If the level of non-response can be anticipated, for example, from recent experience, the calculated or chosen sample size can be increased accordingly, in order still to give a sample of the required size, as , where  is the original sample size,  is the new size, and  is the expected non-response rate as a proportion, e.g., .

Obtaining standard errors of estimates, or confidence intervals, as part of the data analysis indicates how precisely the various quantities of interest have been estimated (see sections 4.1.2. and 10.).