Multivariate poisson-lognormal model for modeling related factors in crash frequency by severity
Mehdi Tazhibi1, Iraj Kazemi2, Somaye Momenyan1, Hossein Haghshenas3
1 Department of Biostatistics and Epidemiology, School of Public Health, Isfahan University of Medical Sciences, Isfahan, Iran
2 Department of Statistics, University of Isfahan, Isfahan, Iran
3 Department of Transportation, School of Transportation, Isfahan University, Isfahan, Iran
|Date of Web Publication||31-Jan-2013|
Department of Statistics, University of Isfahan, Isfahan
Source of Support: None, Conflict of Interest: None
Aims: Traditionally, roadway safety analyses have used univariate distributions to model crash data for each level of severity separately. This paper uses the multivariate Poisson lognormal (MVPLN) models to estimate the expected crash frequency by two levels of severity and then compares those estimates with the univariate Poisson-lognormal (UVPLN) and the univariate Poisson (UVP) models. Materials and Methods: The parameters estimation is done by Bayesian method for crash data at two levels of severity at the intersection of Isfahan city for 6 months. Results: The results showed that there was over-dispersion issue in data. The UVP model is not able to overcome this problem while the MVPLN model can account for over-dispersion. Also, the estimates of the extra Poisson variation parameters in the MVPLN model were smaller than the UVPLN model that causes improvement in the precision of the MNPLN model. Hence, the MVPLN model is better fitted to the data set. Also, results showed effect of the total Average annual daily traffic (AADT) on the property damage only crash was significant in the all of models but effect of the total left turn AADT on the injuries and fatalities crash was significant just in the UVP model. Hence, holding all other factors fixed more property damage only crashes were expected on more the total AADT. For example, under MVPLN model an increase of 1000 vehicles in (average) the total AADT was predicted to result in 31% more property damage only crash. Conclusion: Hence, reduction of total AADT was predicted to be highly cost-effective, in terms of the crash cost reductions over the long run.
Keywords: Bayesian method estimation, crash severity, markov chain monte carlo simulation, multivariate lognormal distribution
|How to cite this article:|
Tazhibi M, Kazemi I, Momenyan S, Haghshenas H. Multivariate poisson-lognormal model for modeling related factors in crash frequency by severity. Int J Env Health Eng 2013;2:30
|How to cite this URL:|
Tazhibi M, Kazemi I, Momenyan S, Haghshenas H. Multivariate poisson-lognormal model for modeling related factors in crash frequency by severity. Int J Env Health Eng [serial online] 2013 [cited 2019 Dec 11];2:30. Available from: http://www.ijehe.org/text.asp?2013/2/1/30/114193
| Introduction|| |
Deaths and injuries related to the road crash are main worry of many governments. Globally, the number of people killed in the road traffic accidents annually is estimated almost 1.2 million, while the number of people injured could be as large as 50 million.  Due to these problems there has been extensive investigation on the prediction models and traffic crash analysis. ,,,,, The accident data are usually collected by severity (e.g., fatal, injury and property damage only).
Severity levels are correlated because the unobservable or omitted variables are expected to be common at all severity levels for a particular segment of roadway or for one intersection. Hence, crash data by severity are basically multivariate in nature. ,,,
Traditionally, analyses have used univariate distributions to model crash data at different levels of severity separately such as Poisson and negative binomial model. ,,,, The application of the univariate model ignores the correlation that exists across crash rates at different levels of severity. This issue leads to the inaccurate estimates for effects of factors. 
Crash data usually have excess zeroes rather than those expected with a Poisson model. , In other words, if the corresponding variance is greater than its expectation in a Poisson model then the over-dispersion problem occurs. To takes into account this extra Poisson variation a random effect term with the gamma or the lognormal distribution enters in the model. Karlis, Ma and Tsionas used the multivariate Poisson (MVP) model to evaluate the effects of different covariates on the collision counts at different levels of severity. Karlis, Ma and Tsionas ,, However, the assumption of non negative covariance component and the over-dispersion problem still exist in the MVP model. The multivariate negative binomial (MVNB) model may solve the over-dispersion problem  for the non negative correlations. While the multivariate Poisson-lognormal (MVPLN) model accounts for the over-dispersion and also for several correlation structures. ,,,,
Then, MVPLN model is more ideal than the MVP and MVNB models, since (i) it accounts for over-dispersion (extra Poisson variation) and (ii) it takes into consideration general correlation structures.  The execution of MVPLN models is not straightforward. The parameter estimation is done within the Bayesian model, using a Markov chain Monte Carlo (MCMC) simulation method.  In this paper, we use the MVPLN model for the analysis of crash data by two levels of severity (Property damage only crash and Injuries plus fatalities crash). The parameters estimation is done in a Bayesian perspective by the Gibbs Sampler and the Metropolis-Hastings (M-H) algorithms that can be carried out using freely available Open BUGS software version 3.2.1. The data collected at the intersection in Isfahan for 6 months.
This paper is organized as follows. First, there is a statistical methodology is reviewed and then data set is described. The paper concludes with a results, discussion and recommendations for future research.
| Methodology|| |
Let yik (i = 1,...,n;k = 1,...,K) denotes the number of collision at the ith intersection and kth category. Assume that observations are uncorrelated across intersections but correlated over categories.
Let further Y be an n × K matrix and their elements Yik follow distributed as Poisson distribution with the mean rate parameter λik .
Assume that the Poisson rates are modeled by a function of covariates following a lognormal distribution, that is
where x′i = (xil,…,x ij)indicates the J-dimensional row vectors of covariate. Let we define X′ = (x l ,…,xn) the βk be the J-dimensional vectors β′k = (β lk,…,βjk) and B = (β l ,…,β K ) be the matrix of regression coefficients.
We model the correlation between yi1 ,…, yiK by incorporating the unobserved heterogeneity components εi 1 ,…,εiK in the model and assume that the vector εi is distributed as K-variate normal with mean zero and variance-covariance matrix ∑, so that ∑ is matrix with general structure to accommodate the correlation among the εik .
This model specification implies that the vector λi follows a K-variate lognormal distribution and the response vector Yi follows a K-variable Poisson-lognormal distribution.
The probability density function of Yi is given by
where ɸk is the K-variate normal distribution. The marginal distribution of the crash counts yi cannot be obtained by direct computation. Obtaining the marginal distribution requires the evaluation of a K-variate integral of the Poisson distribution with respect to the distribution of εi .
The expectation and the variance of the marginal joint distribution of yi and the covariance between the counts can be derived directly. Chib and Winkelmann  show that:
Hence this model allows for the over-dispersion because the variance yik excess the expectation as long as σkk > 0 for k = 1,…, K and the correlation structure is unrestricted because sign of σkj may be positive or negative. In the UVP model for testing over-dispersion we may use the Pearson's Chi-Square statistic divided by the degrees of freedom, i.e., . The value of χ2 greater than 1 indicates the over-dispersion problem.
The Markov Chain Monte Carlo (MCMC) method under the Bayesian framework can be used to compute the integrals and subsequently to utilize the parameter estimation.
At level one, we suppose that the parameters (B,∑) independently have the prior distributions:
where R is scale matrix and k usually assumed to be equal to the rank of matrix R.
In the second level according to Bayes' theorem (posterior ˜ prior × likelihood) the posterior distribution of parameters is obtained. Now the posterior distributions can be obtained using MCMC sampling.
The ∑−1 can be sampled using a Gibbs sampler. In contrast, the posterior distributions of the B and the εi are not in standard forms and thus the Metropolis-Hastings algorithm is used. The MCMC techniques are available in Open BUGS version 3.2.1. MCMC approach generates values of the parameters from approximate distributions and which are distributions that converge to the target posterior distributions. After converging, we can do those Bayesian inferences of the parameters based on the sample mean and sample standard deviation and also based on the sample 2.5 th percentile and the 97.5 th percentile
The interval between the 2.5 th percentile and the 97.5 th percentile are considered as the 95% credible intervals.
The Deviance Information Criteria (DIC) is a standard criterion in the Bayesian models comparison. As a goodness-of-fit measure, DIC is a Bayesian generalization of Akaike's Information Criteria (AIC) that penalizes complex models with larger parameter.
Here, we fit the MVPLN model to the crash data that are classified into K = 2 groups: The property damage only, k = 1, and the injuries plus fatalities, k = 2. This data set collected from 65 intersections in Isfahan city. Although, the original data contained the crash data from 280 intersections, we only used those where the traffic volume data were available. Thus, the sampling method is simple random sampling and size of our sample is 65 intersections. The traffic volume and crash data were provided for a 6-month period from calendar years 2011.
The term "crash", as defined by traffic police of Iran, refers to reportable on-street crash that includes at least one motor vehicle, and results in injury, at least $1,000 in property damage, or both. The data set does not include non-vehicular crash (e.g., cyclist hitting a pedestrian) and includes crash that is forwarded to the police service within a specified period. Total AADT is defined as entering traffic flow for total legs of intersection and total left turn AADT is defined as entering traffic flow for total curb left turn of intersection.
In all of models, crash counts at two levels of severity are considered as dependent variables and total AADT and total left turn AADT variables as independent variables.
There were 182 accidents of property damage only and 66 accidents of injuries plus fatalities at those 65 intersections for 6 month. [Table 1] contains summary statistics of variables of interest.
| Results|| |
We emphasize that while the MVPLN can handle more than two collision categories, here we use only two severity levels leading to a bivariate Poisson-lognormal.
Posterior distributions of parameters were estimated using the OpenBUGS software version 3.2.1. For the underlying models, 10,000 iterations were discarded as burn-in sample to eliminate the influence of starting values and then 100,000 iterations followed to obtain Bayes estimates (posterior means and standard deviations) of the regression coefficients β. Convergence was assessed by visual inspection of the Markov chain for all parameters. Trace plots of the model parameters and Mont Carlo errors was checked. As a rule of thumb, ratios of the Monte Carlo errors relative to the respective standard deviations of the estimates should be less than 0.05.
Results of [Table 2], [Table 3] and [Table 4] show that there are small differences between estimation of the parameters under three models: MVPLN, UVPLN and UNP. Also, they show that some of the parameter estimates are significant in the 95% credible interval. For example the effect of total AADT on the property damage only crash is significant in the all of models. Also, the effect of total left turn AADT on the injuries plus fatalities crash is significant just in the UNP model. Hence, holding all other factors fixed more property damage only crashes are expected on more total AADT. For example, under MVPLN model an increase of 1000 vehicles in (average) total AADT (rising from 4053 to 5053 vehicles) is predicted to result in 31% more property damage only crash, and under the UNP model an increase of 100 vehicles in (average) the total left turn AADT (rising from 1205 to 2205 vehicles) is predicted to increase injuries plus fatalities crash count by 10%.
Results of [Table 2] show that the value of Pearson's Chi-Square divided by degrees of freedom for two levels of severity is also greater than 1. The UNP model is not able to overcome the problem while the MVPLN model can take in to account it. [Table 4] shows that the estimates of the extra Poisson variation parameters in the MVPLN model are greater than zero and as Chib and Winkelmann (1995) show when those parameters are positive the variance will be greater than the expectation. Hence, the MVPLN model can capture over-dispersion.
It is known that the over-dispersion makes the standard errors to be underestimated. In which case standard errors are not accurate estimates of true uncertainties and the confidence interval estimates will not capture the true parameter values. We emphasize that for the unbiased estimates the small standard errors it is mean more precise estimates. This is particularly crucial in cases that involve the most severe crashes: Those that involve fatalities or major injuries. Because the final objective of any highway safety analysis is to reduce the frequency and severity of crashes, it follows that obtaining the most precise estimates of crash frequency at the highest severity levels is highly desirable. By comparing the standard deviation of parameters in [Table 2], [Table 3] and [Table 4], we can see that under UVP model the values of standard deviation of parameters are underestimated and that is why more variables got significant under the UVP rather than the MVPLN and the UVPLN model.
Also, the estimates of the extra Poisson variation parameters in the MVPLN model were smaller than the UVPLN model. As Basyouny and Sayed  showed:
Since the second term in the above equation dominates the first, we obtain
where are the estimates of the extra Poisson variation parameters under the MVPLN and the UVPLN, respectively. As precision is inversely proportional to the variance of expected collision frequency, it is estimated that the MVPLN model is more precise than the UVPLN model. This is because the MVPLN model takes into account the correlation among two levels of severity.
The correlation was estimated 0.712, which is significant. In real meaning, the higher property damage only crash rates are related to the higher injuries plus fatalities crash rates that even after controlling for the covariates, there is extra correlation between these two levels of severity that is not explained and is due to same deficiency in intersection design and/or other unobserved variables. In terms of goodness-of-fit, the Deviance Information Criteria (DIC) quantifies the relative goodness-of-fit of the models; therefore, it is useful for comparing models. As are presented in [Table 2], [Table 3] and [Table 4] the MVPLN model provides a better fit over the UVPLN and the UVP model as DIC for the MVPLN model is smaller than the sum of DICs for two levels of severity under the UVPLN and the UVP model.
| Discussion|| |
Traditionally, univariate models are used to analyze crash data for different levels of severity separately. This research used a MVPLN model to jointly analyze a sample of crash counts classified by two levels of severity at 65 intersections in Isfahan city. Furthermore, for comparison the UVPLN and the UVP models were fitted for each of severity levels. Results showed that the MVPLN model preferred rather than the UVPLN and the UVP model. Because univariate models neglect the correlation of crash counts at different levels of severity while the MVPLN model allows for a general correlation structure among different levels of severity, as well as handing over-dispersion. Parameters estimates of all models are achieved within the Bayesian framework that is implemented by using the OpenBUGS software.
Both, model parameter estimates and their variances were discussed. As expected, the results provide several recommendations for urban intersection safety treatments. For example, reduction of total AADT is predicted to be highly cost-effective, in terms of the crash cost reductions over the long run. Findings of this paper confirm with those already illustrated by other authors. For example in the study that is implemented in Washington State on rural two-lane Roadways by Ma et al., (2006) an increase of 1000 vehicles in average annual daily traffic level (rising from 3757 to 4757 vehicles) was predicted to increase total crash count by 16.4% and the disabling crash count by 40% at MVPLN model.  Park and Lord (2007) applied three models of the MVPLN, the UVP and the UVNB on the crash count data of five different severity levels collected from unsignalized intersections in California. They showed that effects of minor and major AADT of the intersection are significant at the five severity levels for three models. In their study the variables such that the Painted Left Turn and the Curb Med Left Turn were checked just two variables minor and major AADT were significant at two levels of severity. Thus, the reduction of traffic volume at the intersection is more considerable than other variables to reduce accident at two levels of severity. They also illustrated the over-dispersion problem in their data and thus fitting the UVP model leads to underestimation of standard deviations.  Also, Aguero-Valverde (2012) showed that effect of AADT on crash for rural two-lane segments that located in Central Pennsylvania is significant when fitting Poisson gamma, Poisson lognormal, and zero inflated random effects models.  In another study, Aguero-Valverde applied the MVPLN model for five levels of severity. He examined the ADT variable simultaneously with variables such that the Shoulder width, the Lane width and the Speed limit. Among all variables only ADT was significant at all levels of severity. El-Basyouny and Sayed (2009) used the MVPLN and the UNPLN for the purpose of developing collision prediction models relating the safety of urban 4-leg intersections to their traffic flows. They showed the effects of minor and major AADT are significant at two levels of severity (Property damage only and Injuries and fatalities crash) under two models MVPLN and UVPLN. they further showed the MVPLN model offers more precise estimates than the UVPLN model. 
The correlation among two levels of severity was estimated 0.71, which this correlations may be caused by omitted variables (such as painted left turn, curb med left turn, Speed limit and surrounding land use), which can influence crash occurrence at all levels of severity. Essentially, higher crash rates of one type are associated with higher crash rates of other types. The MVPLN model accounts this correlation. That is the reason why it offers more precise estimates than the UVPLN model. results showed the estimates of the extra Poisson variation parameters in the MVPLN model were smaller than the UVPLN model. Since variance of expected collision frequency is function of the extra Poisson variation parameters and the precision is inversely proportional to the variance of expected collision frequency then precision of the MVPLN model is larger than the UVPLN model.
In terms of goodness-of-fit, the MVPLN model provides a better fit over the two other models as the DIC for the MVPLN model is less than the sum of DICs for the UVPLN and the UVP models. This paper is based on a small dataset and for a short period but despite the limitations of these data results of this paper conform to those in literatures. Nevertheless, there are challenges for further research, for example crash data usually have low sample mean values, small sample size, and statistical models become very unstable when they are estimated using this kind of data. , Hence, stability of models should be investigated along with the sensitivity analysis of models for different hyper-prior specifications. Furthermore, the changes in the signs of coefficients between different levels of severity need to be investigated. AADT models suffer from omitted variables bias but such models are well accepted in the traffic safety literature.
In this paper, two severity levels were investigated. However, different levels of severity could be included (e.g., fatality, severe injury, light injury, PDO, etc.), different crash types could be considered (e.g., angle, head-on, rear-end, sideswipe, etc.), or geometric design feature that are not available in our dataset could be incorporated in the model. By means of improvement in statistical tools, which allows for a more precise analysis and the appropriate dataset transportation engineers are able to better understand the relationship between crash counts and related factors in its occurrence.
| Acknowledgments|| |
The authors would like to thank the contribution of the traffic control and traffic police center of Isfahan, which provided the data for this analysis. (Project Number: 390592).
| References|| |
|1.||Eksler V, Lassarre S, Thomas I. Regional analysis of road mortality in Europe. Public Health 2008;122:826-37. |
|2.||Kim DG, Lee Y, Washington S, Choi K. Modeling crash outcome probabilities at rural intersections: Application of hierarchical binomial logistic models. Accid Anal Prev 2007;39:125-34. |
|3.||Lord D, Manar A, Vizioli A. Modeling crash-flow-density and crash-flow-V/C ratio relationships for rural and urban freeway segments. Accid Anal Prev 2005;37:185-99. |
|4.||Miaou SP, Lord D. Modelingtraffic crash-flow relationships for intersections: Dispersion parameter, functional form, and Bayes versus empirical Bayes methods. Transp Res Rec 2003;1840:31-40. |
|5.||Miaou SP, Song JJ. Bayesian ranking of sites for engineering safety improvements: Decision parameter, treatability concept, statistical criterion, and spatial dependence. Accid Anal Prev 2005;37:699-720. |
|6.||Oh J, Lyon C, Washington S, Persaud B, Bared J. Validation of FHWAcrash models for rural intersections: Lessons learned. Transp Res Rec 2003:1840:41-9. |
|7.||Oh J, Washington S, Choi K. Development of accident prediction models for rural highway intersections. Transp Res Rec 2004;1897:18-27. |
|8.||Aguero-Valverde J, Jovanis PP. Bayesian multivariate Poisson lognormal models for crash severity modeling and site ranking. Transp Res Rec 2009;2136:82-91. |
|9.||Ma J, Kockelman KM, Damien P. A multivariate Poisson-lognormal regression model for prediction of crash counts by severity, using Bayesian methods. Accid Anal Prev 2008;40:964-75. |
|10.||Park ES, Lord D. Multivariate Poisson-lognormal models for jointly modeling crash frequency by severity. Transp Res Rec 2007;2019:1-6. |
|11.||El-Basyouny K, Sayed T. Collision prediction models using multivariate Poisson-lognormal regression. Accid Anal Prev 2009;41:820-8. |
|12.||Balkin S, Ord JK. Assessing the impact of speed-limit increases on fatal interstate crashes. J Transp Stat 2001;4:1-26. |
|13.||Pernia J, Lu JJ, Peng H. Safety Issues Related to Two-way Left-turn Lanes; Transportation Research Board of the National Academies (Transp Res Rec): 2004. |
|14.||Vogt A, Crash models for rural intersections: Four-lane by two-lane stop-controlled and two-lane by two-lane signalized; Transportation Research Board of the National Academies (Transp Res Rec): 1999. |
|15.||Vogt A, Bared J. Accident models for two-lane rural segments and intersections. Transp Res Rec 1998;1635:18-29. |
|16.||Zegeer CV, Stewart JR, Huang HH, Lagerwey PA. Safety Effects of marked vs. unmarked crosswalks at uncontrolled locations: Executive summary and recommended guidelines; Transportation Research Board of the National Academies (Transp Res Rec): 2002. |
|17.||Lord D, Washington S, Ivan JN. Further notes on the application of zero-inflated models in highway safety. Accid Anal Prev 2007;39:53-7. |
|18.||Lord D, Washington SP, Ivan JN. Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: Balancing statistical fit and theory. Accid Anal Prev 2005;37:35-46. |
|19.||Karlis D. An EM algorithm for multivariate Poisson distribution and related models. J Appl Stat 2003;30:63-77. |
|20.||Ma J, Kockelman KM. Bayesian multivariate Poisson regression for models of injury count, by severity. Transp Res Rec 2006;1950:24-34. |
|21.||Tsionas EG. Bayesian multivariate Poisson regression. Commun Stat Theory Methods 2001;30:243-55. |
|22.||Ladron de Guevara F, Washington S.P, Oh J. Forecasting crashes at the planning level: Simultaneous negative binomial crash model applied in Tucson, Arizona. Transp Res Rec 2004;1897:191-9. |
|23.||Tunaru R. Hierarchical Bayesian models for multiple count data. Aust J Soil Res 2002;31:221-9. |
|24.||Chib S, Winkelmann R. Markov chain Monte Carlo analysis of correlated count data. J Bus Econ Stat 2001;19:428-35. |
|25.||Aguero-Valverde J. Full Bayes Poisson gamma, Poisson lognormal, and zero inflated random effects models: Comparing the precision of crash frequency estimates.Accid Anal Prev 2013;50:289-97. |
|26.||Lord D. Modeling motor vehicle crashes using Poisson-gamma models: Examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accid Anal Prev 2006;38:751-66. |
|27.||Lord D, Miranda-Moreno LF. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of Poisson-gamma models for modeling motor vehicle crashes: A Bayesian perspective. Saf Sci 2008;46:751-70. |
[Table 1], [Table 2], [Table 3], [Table 4]