Classification and regression tree (CART) analysis

This analysis is useful for modelling diseases that have multiple contributing factors and an incomplete data set for quantifying possible risk factors in both the disease and disease-free populations. The CART analysis is a non-linear and non-parametric model, fitted by binary recursive partitioning of multidimensional co-variate space (Breiman et al., 1984, Saegerman et al., 2004, Speybroeck et al., 2004). Using CART 6.0 software (Salford Systems; San Diego, USA), the analysis successively splits the data set into increasingly homogeneous subsets until it is stratified and meets specified criteria. The Gini index is normally used as the splitting method, and a ten-fold cross-validation is used to test the predictive capacity of the trees obtained. The CART analysis performs cross-validation by growing maximal trees on subsets of data, then calculating error rates based on unused portions of the data set.

The consequence of this complex process is a set of fairly reliable estimates of the independent predictive accuracy of the tree, even when some data for independent variables are incomplete and/or comparatively scarce. Further details about CART are presented in previously published articles (Saegerman et al., 2011).