High-resolution traffic accident prediction in Berlin

Problem

Traffic accidents are one of the leading causes of death and injuries globally, and in Germany they cause over 3,000 deaths and 300,000 injuries yearly. Studying traffic accidents and where and why they happen can help policy makers make our roads safer.

There are at least two key issues to tackle when predicting traffic accidents. First decision to make is on the granularity, or spatial precision, to which we predict accidents. One approach could be focusing only on a few select roads that are covered by sophisticated measurement instruments like loop detectors and cameras. This allows researchers to precisely measure important features such as traffic flow and lane occupancy. Another approach could be predicting accidents, or their severity, on datasets with millions of accidents aggregated at the level of the country. This approach helps uncover patterns that would be difficult to find on smaller datasets, but lacks precision that could help policy makers improve traffic safety at the level of a community or a city.

Second key issue with traffic accident prediction is a computational one. Although traffic accidents have serious consequences they remain, luckily, rare events. However, many machine learning models do not do well when predicting rare events – in part due to the focus on minimisation of the overall error rather than the detection of the rare class. Fortunately, computational approaches and models that tackle imbalanced datasets do exist and seem to be successful in predicting rare events.

We wanted to build on previous work in two main ways. First, inspired by a recent paper (Hébert et al. 2019) we applied a novel approach in traffic accident forecasting that aims to predict the risk of an accident on a particular road segment withing a city road network at a particular day and hour. We believe that this approach is particularly useful for policy makers interested in road-safety improvement. Second, we explore and provide new evidence on the performance of different approaches for rare event prediction in the field of traffic accident forecasting.

Data

We conducted this study using the example of accidents in Berlin using the following data sources:

Records of traffic accidents in Berlin: Open Data Berlin shares records of all traffic accidents that happened between 2018 and 2020; a total of 38,851 occurrences. For this study we used data on GPS-coordinates, year, month, day of the week, and hour of the accident.
Berlin road segment data: We used two datasets provided by the Open Data Informationsstelle (ODIS):
- To match locations of traffic accidents (Figure 1a) to road segments in Berlin (Figure 1b) we used the existing geometric information dataset on road segments in Berlin. From here we also used data on the length of the road segment.
- We used road segment surface dataset to extract the information on whether a road segment is a main road or a side street.

Figure 1: (a) Collision points (above), (b) Road Network (below)

Weather data: Using the Wetterdienst API we collected data on temperature, humidity, precipitation duration, precipitation height, and visibility for every accident location, day of the week, and hour of the day between 2018 and 2020 from 5 Berlin weather stations.
Sun elevation: We used the Python API PySolar to collect data on sun elevation angles per date and location in Berlin.

Method

The workflow of this study can be seen in Figure 2.

Figure 2: Study workflow

Generation of negative examples

After we collected raw data, we first generated negative examples, that is, data points when accidents did not happen using Apache Spark with Pyspark. In traffic accident forecasting only positive examples (that is, occurrence of the accidents) are recorded. This makes sense, particularly as most of such records come from a traffic authority responding to an accident. However, that also means that available datasets have no records of negative examples at all – so they have to be generated. Luckily, having an accurate database of accidents (recorded by the Berlin police), means that we know when accidents did not happen – and that is at all possible other locations and times that were not recorded as accidents! Of course, it is not feasible to analyse all examples where accidents did not happen, as that would be computationally difficult. What we did instead, following the usual approach, was to calculate all possible combinations of time and locations for negative events and then drew a sample from it. We tried different accident/non-accident rations, as we will describe later, but in the end the one we worked with was 5 non-accidents for every accident.

Data pre-processing

The major part of data pre-processing included matching the accident GPS-locations to existing road segments in Berlin with geopandas. This was done by creating a spatial join function and specifying a buffer. The matching function finds which road segments intersect the collision point’s circle area and match these segments to the collision point. Increasing (decreasing) the buffer value makes the circle area larger (smaller), thereby increasing (decreasing) the number of road segments matched to a collision point. For example, setting the buffer to 2m (Figure 3a) identifies fewer road segments than setting the buffer to 20m (Figure 3b). A bigger buffer was also more correct in detecting accidents that occurred on road intersections. Different buffer values were used to make sure that we match all accident and non-accident GPS-locations to corresponding segments.

Figure 3: (a) Matching 2m buffer (left), (b) Matching 20m buffer (right)

We then matched all other data to road segments’ midpoints. For weather features, we used the data for the closest weather station, and if that data was missing, second closest. Since Berlin accident dataset does not report the exact day of the month when the accident happened, we computed the monthly average per hour and used that information instead. Time features were recoded cyclically using sine and cosine functions to account for the fact that the extreme values (e.g. hour 1 and 23) have a similar meaning. The features we included in the model were:

month_cos and month_sin: cosine and sine of cyclical encoded month
hour_cos and hour_sin: cosine and sine of cyclical encoded hour
weekday: day of the week (Monday = 1, Sunday = 7)
collision_cnt: number of accidents at a road segment in the previous year/s
length_m: lenght of a road segment in meters
side_strt: road segment is a side street (1) or a main road (0)
temperature (°C)
humidity (%)
visibility (m)
prec_height: average mm per 10 minutes intervals in the hour before event
prec_duration: average minutes per 10 minutes intervals in the hour before the event
sun_elevation_angle: solar elevation angle in degrees
collision: the target we are predicting; accident occurred (1) vs. not (0)

We then split the dataset into training (years 2018 and 2019) and test (year 2020). The final training dataset had a total of 52,118 road segments where accidents occurred and 260,702 where they didn’t. Here’s a snippet of the final dataset we were working with:

X	segment_id	year	month_cos	month_sin	weekday	hour_cos	hour_sin	collision_cnt	side_strt	sun_elevation_angle	humidity	temparature	visibility	prec_height	prec_duration
0	42796	2019	0.866	-0.5	6	0.963	0.270	0	1	-53.638	86.187	4.081	20049.46	0.009	2.242
1	34322	2020	0.866	0.5	6	0.460	0.888	0	1	-36.877	88.226	3.458	23919.35	0.003	0.876
2	29497	2018	-1.000	0.0	2	-0.776	0.631	1	1	35.339	55.533	20.693	34946.67	0.002	0.431
3	39002	2020	0.000	-1.0	6	-0.335	0.942	0	1	1.920	76.000	14.227	36333.33	0.006	0.911
4	32881	2019	-0.866	-0.5	2	1.000	0.000	0	1	-14.045	72.806	17.171	40709.68	0.001	0.113
5	41584	2019	0.000	-1.0	5	-0.335	0.942	0	0	2.486	78.138	13.845	33616.67	0.005	0.928

And here’s what the variables’ distributions look like:

Figure 4: Histograms of features for the training dataset

Finally, we can already get a glimpse of whether or not some road segments are more dangerous than others. As we can see from the Figure 5 below, it does seem that some main, long road segments – that lead around the city centre or in it – are more dangerous than small, side road segments. But, to see how correct we are when we make this prediction, we need to train classification models.

<<<<<<< HEAD

======= $Road segments with the highest (above) and lowest (below) number of accidents$ >>>>>>> 225f9230b6ecdac949d736864ca72fc4f59d762b

Figure 5: Road segments with the highest (above, n = 1300) and lowest (below, n = 1300, random selection) number of accidents

Selected models

We implemented 7 machine learning models:

Logistic Regression (Logit)
Standard Random Forrest (SRF)
Balanced Random Forrest (BRF)
Standard Random Forrest with SMOTE and RUS (SRF_SMOTE_RUS)
Balanced Bagging (BB)
Support Vector Classifier with SMOTE and RUS (SCV_SMOTE_RUS)
Extreme Gradient Boosting (XGBoost)

The models we selected are widely used for various problems and also in prediction of rare events. However, a small note on SMOTE and RUS techniques will be made. Synthetic Minority Over-sampling Technique or SMOTE and Random Under-sampling or RUS are re-sampling techniques that have been found to work well for imbalanced datasets. A discussion of these techniques is beyond the scope of this article, but a detailed discussion can be found elsewhere (Chawla et al. 2002). For this study, we applied SMOTE and RUS using 30% over-sampling and 50% under-sampling. In this case that means that on a training dataset containing 300,000 observations with imbalance factor of 5 (250,000 negative:50,000 positive), we first re-sample the minority class to have 30% the number of the majority class (75,000), then use random under-sampling to reduce the number of examples in the majority class to have 50% more than the minority class (150,000).

Hyperparameter tuning

We used random- and grid- search with 5-fold cross validation to tune our hyperparameters. As a scoring metric we chose the Area Under the ROC curve. We parametrized our search space by different hyperparameters unique to each model architecture; for instance, in Random Forest models, the number of trees, maximum depth, and minimum samples required to be at a terminal node, among others, were optimized. Moreover, we parametrized the sampling strategies - the intended class imbalance factor after re-sampling - that is applied to re-sample the dataset for our model implementations that use SMOTE and RUS.

Evaluation approach

We evaluated our models by calculating the Area under the ROC curve (AUC-ROC) that is considered a good, general measure for classification problems. Additionally, we aimed for a model with a high Recall, which also implies a relatively higher false positive rate (FPR), and a lower prediction. As traffic accident prediction is a type of rare event prediction, our goal was to train a model with a high Recall since false positives can also correspond to high-risk circumstances which we also want to detect and, ultimately, avoid. An overview of these performance metrics and the corresponding confusion matrix can be seen in Figure 6.

Figure 6: Calculation of Precision, Recall and F1-score in the Confusion Matrix

Results

Before training the models on our dataset, we conducted an experiment to compare the performance of the baseline model with different ratios of positive and negative events (accidents vs. non-accidents). We compared results from a logistic regression model and a random forest model with two different imbalance ratios that were previously described in the literature: around around 2:1 (Santos et al. 2021) and 17:1 (Hébert et al. 2019). For this comparison we used only time features (year, month, weekday and hour) and the number of accidents per road segment.

Our results indicated that the random forest model yielded the highest Recall for both imbalance ratios (70.4%), but that Precision was largely reduced by a high imbalance ratio. Similar trend of results, but with much worse evaluation metrics, was obtained by the logit model. AUC_ROC scores did not change much with the imbalance ratio. Therefore, we concluded that our models did not seem very sensitive to the imbalance factor of our data. This corroborated our approach of drawing a uniform random sample of non-accidents.

Finally, we trained our models on the training set and compared their performance on the test set. The results in Table 1 and the Figure 4 show the following:

BRF distinguished between positive and negative classes the best, with an AUC_ROC of 88.7%. BRF also yielded the highest Recall (94.5%), which implies that it predicts most of the relevant results corresponding to high-risk situations, and still had an acceptable Precision (50.4%).
Similar to BRF, applying over-sampling and under-sampling techniques to our imbalanced data combined with SRF resulted to a high AUC_ROC and Recall above 80% and a Precision of almost 70%.
Without any balancing re-sampling techniques, the performance of SRF also has a high AUC_ROC of 84% with both Precision and Recall above 70%.
With a high Recall, all models have similar performances except for BB which has a low Precision. The ROC curve also suggests that only BB has a lower performance among the models.
Moreover, the Logit model actually performed quite well with 86.8% AUC_ROC, 80.6% Recall, and 67.6% Precision.
SVC_SMOTE_RUS yielded the highest Precision of 89.3%, however, it had the lowest AUC_ROC (below 80%). Although BB had AUC_ROC and Recall above 80%, it resulted with a Precision of only 40%.

Therefore, we concluded that the best performing model was the Balanced Random Forest (BRF) that predicted 94.5% of traffic accidents in Berlin.

Table 1: Comparison of Classification Models
Model	AUC_ROC	Precision	Recall	F1_score
Logit_baseline	88.3%	58.1%	88.3%	70.0%
SFR_baseline	83.6%	73.3%	72.0%	72.3%
Logit	86.8%	67.6%	80.6%	73.6%
SRF	84.0%	77.5%	71.9%	74.6%
BRF	88.7%	50.4%	94.5%	65.7%
BB	80.5%	40.4%	83.7%	54.5%
SRF_SMOTE_RUS	87.4%	68.1%	81.8%	74.3%
SVC_SMOTE_RUS	75.4%	89.3%	52.0%	65.7%
XGBoost	85.4%	75.4%	75.4%	75.4%

Figure 7: Precision-Recall (above) and ROC Curves (below) of the 7 Classification Models

Analysis

Figure 8 below shows the best features in our final model.

Figure 8: Feature importance in the Balanced Random Forest model

Our results echo those of the previous studies: previous number of accidents at a road segment was by far the most predictive feature. Some other road infrastructure features closely related to the number of accidents, such as segment length and whether a road is a side street, were also important. From non-infrastructure related features, sun elevation angle was the most important one. Unlike in previous studies, weather features and time features were a bit less important here. One reason that might be is that Berlin dataset does not record an exact day of the year when the accident happened, only the day of the week. This also meant that the weather features had to be averaged, which of course reduced their precision.

With respect to our approach of applying combined re-sampling approaches - SMOTE and RUS - and a diverse set of models - linear, tree-based, SVC, XGBoost, Balanced Bagging - we achieved good performance with almost all models. However, we could not improve performance to established models such as Balanced Random Forest, which has been the overall best-performing model in terms of chosen evaluation metrics.

Conclusion

Our study used several publicly available datasets to build a high-resolution traffic accident prediction models using 7 classification models. Our best-performing model, the Balanced Random Forest, yielded a 94.5% Recall at predicting road accidents in Berlin on a given road at a given date and hour with a 50% Precision. For future research, we suggest incorporating new features to describe the traffic flow and urban infrastructure (i.e., number of schools, proximity to central business districts, number of traffic lights, etc.) around the area where the accidents occurred, since these seem to be largely neglected in existing traffic prediction studies.

Ultimately, we hope that our project can be useful to policy makers for identifying the most dangerous road segments per hour of the day and, thus, be a basis for taking action on reducing risk of accidents. Our work also shows that providing administrative data - through initiatives such as Open Data Berlin - can create value to the general public.

Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57.

Hébert, Antoine, Timothée Guédon, Tristan Glatard, and Brigitte Jaumard. 2019. “High-Resolution Road Vehicle Collision Prediction for the City of Montreal.” In 2019 IEEE International Conference on Big Data (Big Data), 1804–13. IEEE.

Santos, Daniel, José Saias, Paulo Quaresma, and Vı́tor Beires Nogueira. 2021. “Machine Learning Approaches to Traffic Accident Analysis and Hotspot Prediction.” Computers 10 (12): 157.