Statistical modelling for falls count data
Introduction
Falls can have common and serious consequences for older people (Robertson et al., 2005). With an ageing population, the rise in the number of falls and the cost of their treatment is predicted to lead to a huge burden on the individual and the community (Moller, 2005). Falls epidemiology data describing the magnitude of, and trends in, the problem has largely been descriptive in nature (Boufous et al., 2006, Boufous et al., 2004). It is important that good statistical models are used to generate accurate and reliable information to guide policy decisions in relation to priority setting and intervention investments to tackle the fall injury problem. As with other areas of public health, there has been an increased interest in statistical modelling of injury count data, including falls outcomes, in recent years (Chin and Quddus, 2003, Lord et al., 2004, Lord et al., 2005, Robertson et al., 2005).
Datasets of the number of fall and fall-related injury have the form of discrete count data characterized by a large proportion of zero counts, with the remaining values being highly skewed toward the right. This is because fall incidents are relatively rare and most people will not sustain a serious injury if do they fall. Moreover, falls can also be recurrent events, in that over a period of time an individual may experience one or more falls (Williamson et al., 1996, Stalenhoef et al., 2002), and this recurrence aspect needs to be incorporated into appropriate statistical models of fall counts. In a very recent systematic review (Donaldson et al., 2009), fewer than one-third of the 83 reviewed papers used appropriate statistical methods to analyse falls as a recurrent event.
To further progress falls epidemiology, there is a need for a unified and justified approach to the use of appropriate statistical models for these data, taking into account the large proportion of zero counts and the possibility of recurrent falls. A number of published studies have incorrectly assumed a normal distribution when modelling falls count data and used Student's t-test, linear regression, or analysis of variance, as has been highlighted elsewhere (Robertson et al., 2005). Other analysts have argued that falls count data does not meet the usual normality assumption required of many standard statistical tests and have therefore relied on a transformation to induce normality (Slymen et al., 2006). This can be problematic in that transformations often do not yield normally distributed data and can make the interpretation of regression coefficients difficult because they are not estimated on the original scale (Byers et al., 2003).
An alternative, more common, approach has been to assume a Poisson (P) model which is better suited to fall count processes and has become quite widespread in public health to model the number of events or rates (Mwalili et al., 2008), especially when there are few incidents and hence, many observed zeros (Shankar et al., 1997). However, if the number of observed zeros far exceeds the expected number of zeros (equivalent to requiring that the mean is equal to the variance), then one of the key features of the P structure is violated. Often, falls count data exhibit more variability than the nominal variance under the P model, a condition called over-dispersion (in that the sample variance exceeds the mean). Such over-dispersion in count data can occur because of excess zeros, unexplained heterogeneity, or temporal dependency (Cameron and Trivedi, 1998). With regards to recurrent events, the P model assumes that such events occur independently of each other. This assumption is violated for fall outcomes, as a major risk factor for a subsequent fall is a previous fall (Donaldson et al., 2009, Hill et al., 1999).
The negative binomial (NB) model has a built-in dispersion parameter that can account for situations where the variance is greater than the mean (Chin and Quddus, 2003). A number of studies have therefore argued for the NB model as an alternative to the P model when count data are over-dispersed in relation to the mean (Bliss and Fisher, 2003, Byers et al., 2003, White and Bennetts, 1996). Such a modelling approach can also be appropriate when count data are recurrent (Glynn and Buring, 1996). The NB model explicitly accounts for the heterogeneity by modelling the Poisson mean as a Gamma random variable and introducing an extra dispersion parameter (Johnson et al., 2005, Lord, 2006).
Although P and NB models have been the most common choices to date, it is possible that they could still fail to fit a set of data with a lot of zeros because of zero-inflation, over-dispersion, or both (Deng and Paul, 2005). As an extension of standard P and NB models, zero-inflated count models have gained considerable recognition as an alternative means of handling count data with a preponderance of zeros (Lambert, 1992, Gupta et al., 1996, Li et al., 1999, Lord et al., 2004, Lord, 2006). For this type of count data, more zeros are observed than would be predicted by a normal P or NB process (Park and Lord, 2009, Lord et al., 2007, Warton, 2005). It is generally believed that data with excess zeros come from two sources or two distinct distributions, hence the apply-named dual state process. The underlying assumption of this two-state process gives a simple two-component mixture distribution with the first state having only zeros, while the other state leads to a standard P or NB count model. In general, the zeros from the first state are called structural zeros and those from the P or NB models are called sampling zeros or non-structural zeros.
In recent years, there has been considerable interest in regression models based on zero-inflated count models. Much of this interest stems from the seminal paper of Lambert (1992) though this type of model appears to have originated in the econometrics literature. Mullahy (1986) first formulated the zero-inflated Poisson (ZIP) regression model and such models have since been applied in many topic areas: the number of defects in a manufacturing process (Lambert, 1992); the abundance of rare species (Welsh et al., 1996); road accident frequencies (Shankar et al., 1997, Shankar et al., 2003, Qin et al., 2004, Kumara and Chin, 2003, Lee and Mannering, 2002); dental caries epidemiology (Bohning et al., 1999); pharmaceutical utilization and expenditure (Street et al., 1999); early growth and motor development (Cheung, 2002); and physical activity (Slymen et al., 2006).
In addition to zero-inflated models, there are many further extensions to the classical P and NB models, such as finite mixture models. These finite mixture models are particularly useful for heterogeneous populations that incorporate a combination of counts and continuous representation of population heterogeneity. For a mathematical derivation and discussion of the application of finite mixture models, readers are referred to McLachlan and Peel (2000). Most recently, Park and Lord (2009) have proposed finite mixtures of P and NB models for analyzing motor vehicle crash data.
The modelling considerations raised above have significant implications for the description of falls data and published studies have used a variety of statistical approaches. To our knowledge, a full range of P and modified P (i.e. NB and zero-inflated) models have not been formally compared in terms of their applicability to falls data. Although Robertson et al. (2005) used the NB model in their consideration of statistical models for falls intervention trials, they compared it to two survival analysis models (the Andersen-Gill and marginal Cox regression) and not directly to other count distributions.
The aim of this paper is therefore compare the applicability of statistical count distributions to falls count data and to provide a clear rationale for future falls distribution-modelling approaches. In doing so, this study provides defensible guidance on how to appropriately model falls data in studies aiming to describe trends in injury numbers and rates. The paper has five objectives, to (1) overview the rationale for, and use of, P, NB, ZIP and zero-inflated negative binomial (ZINB) models, (2) apply the four models to real-world falls count data and to compare how well the various models approximate this, (3) formally compare the four models, (4) report a statistical simulation experiment as a means of assessing the size and power of the model fit, and (5) compare the NB model with finite mixtures of P or NB estimated using the same data.
Section snippets
Methods
A description of the data used in the example is first presented, so that the relevant features of the four regression models can be later described in the specific context of these data.
Model estimation framework
For each of the four model types, the maximum likelihood estimation (MLE) method was used to estimate μ, k and ϕ parameters and their corresponding standard errors and confidence limits for the falls count data, as relevant. The MLE was chosen, compared to other estimators, because it has properties of consistency, asymptotic normality and minimum variance for large samples. The MLE method was used to fit the falls data by applying a generalised linear model from underlying P, NB or
Model accuracy
The most common criterion for evaluating the performance of a statistical model is its accuracy in terms of fitting the data. Let fi denote the observed frequency of ith fall and denote the fitted frequency. The error is defined as and the percentage error is pi = 100ei/fi. Percentage errors have the advantage of being scale independent, so they are frequently used to compare model performance between different data series (Hyndman and Koehler, 2006). The most widely used measures
Comparing models
Four criteria were used to compare and select among considered models: likelihood ratio, F-test, Vuong statistic and bootstrap test. The likelihood ratio test is well understood and is not discussed further. The basic criterion of the F and bootstrap tests is to compare two models where one model should be nested with the other model (i.e. when one model is an extension to the other). For example, the P model is nested within the NB model and there is therefore a need to test if there is
Simulation framework
Simulation studies are increasingly being used in the public health literature for a wide variety of situations (Vaeth and Skovlund, 2004). There are several advantages of simulations compared with collecting and/or analyzing real data (Burton et al., 2006, Demirtas, 2007). Firstly, a large number of samples of representative falls data can be created rather than being restricted to using only one (or just a few) dataset and this enables the distributions of statistical parameters to be
Comparison of finite mixture models with standard and zero-inflated P and NB models.
The Poisson and NB mixture models with a fixed number of components (K = 2, 3) were estimated with the expectation-maximization (EM) algorithm within a maximum likelihood framework and with Markov Chain Monte Carlo (MCMC) sampling within a Bayesian framework (Stasinopoulos and Rigby, 2007, Leisch, 2004). Models were compare using a penalised-likelihood approach for model selection: Akaike's information criterion (AIC) and the Bayesian information criterion (BIC) (Park and Lord, 2009, Warton, 2005
Conclusions
There are several well-developed potential statistical models for analyzing falls count data but, to date, there has been little guidance on which is the most appropriate approach to use, and there are many published studies that have used incorrect statistical models for analyzing over-dispersion and recurrent fall events (Donaldson et al., 2009). Robertson et al. (2005) compared the NB model to two survival analysis models using two datasets, and concluded that the NB model was as appropriate
Acknowledgements
Project work supported by a grant from the Australian Government Department of Health and Ageing to undertake falls modelling research provided the impetus for this paper. John Campbell and Clare Robertson, Department of Medical and Surgical Sciences, Dunedin School of Medicine, University of Otago, New Zealand provided the falls data from the New Zealand trial used in this study. Dominique Lord, Zachry Department of Civil Engineering, Texas A&M University and Byung-Jung Park, Texas
References (58)
- et al.
The epidemiology of hospitalised wrist fractures in older people, New South Wales, Australia
Bone
(2006) - et al.
Application of negative binomial modelling for discrete outcomes: a case study in ageing research
Journal of Clinical Epidemiology
(2003) - et al.
The power of bootstrap and asymptotic tests
Journal of Econometrics
(2006) - et al.
Analysis of zero-adjusted count data
Computational Statistics and Data Analysis
(1996) - et al.
Falls among healthy, community-dwelling, older women: a prospective study of frequency, circumstances, consequences and prediction accuracy
Australian and New Zealand Journal of Public Health
(1999) - et al.
Another look at measures of forecast accuracy
International Journal of Forecasting
(2006) - et al.
Impact of roadside features on the frequency and severity of run-off-roadway accidents: An empirical analysis
Accident Analysis and Prevention
(2002) Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter
Accident Analysis and Prevention
(2006)- et al.
Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory
Accident Analysis and Prevention
(2005) - et al.
Further notes on the application of zero-inflated models in highway safety
Accident Analysis and Prevention
(2007)
The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions
Accident Analysis and Prevention
Current costing models: are they suitable for allocating health resources? The example of fall injury prevention in Australia
Accident Analysis and Prevention
Specification and testing of some modified count data models
Journal of Econometrics
Application of finite mixture models for vehicle crash data analysis
Accident Analysis and Prevention
Selecting exposure measures in crash rate prediction for two-lane highway segments
Accident Analysis and Prevention
Modeling accident frequencies as zero-altered probability processes: an empirical inquiry
Accident Analysis and Prevention
Modeling crashes involving pedestrians and motorized traffic
Safety Science
A risk model for the prediction of recurrent falls in community-dwelling elderly: a prospective cohort study
Journal of Clinical Epidemiology
Cost-sharing and pharmaceutical utilisation and expenditure in Russia
Journal of Health Economics
Modelling the abundance of rare species: statistical models for counts with extra zeros
Ecological Modelling
Repeated measures analysis of binary outcomes: applications to injury research
Accident Analysis and Prevention
Fitting the negative binomial distribution to biological data
Biometrics
The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology
Journal of the Royal Statistical Society A
Incidence of hip fracture in New South Wales: are our efforts having an effect?
The Medical Journal of Australia
The design of simulation studies in medical statistics
Statistics in Medicine
Regression Analysis of Count Data
Zero-inflated models for regression analysis of count data: a study of growth and development
Statistics in Medicine
Modeling count data with excess zeroes: an empirical application to traffic accidents
Sociological Methods and Research
Letter to the Editor re: the design of simulation studies in medical statistics
Statistics in Medicine
Cited by (37)
A socio-spatial analysis of pedestrian falls in Aotearoa New Zealand
2021, Social Science and MedicineCitation Excerpt :Negative binomial regression performed well when tested in a similar setting (Ullah et al., 2010), and this method handled the over-dispersed data better than a comparable Poisson regression model when tested with our data. A zero-inflated model was also tested, but the non-zero-inflated model was preferred for reasons of parsimony (as in Ullah et al., 2010). Two regression analyses were conducted: one with the number of falls among all adults as the outcome, and one with the number of falls among over 65s only as the outcome.
A two-parameter general inflated Poisson distribution: Properties and applications
2016, Statistical Methodology