<<<<<<< HEAD ======= >>>>>>> 225f9230b6ecdac949d736864ca72fc4f59d762b

High-resolution traffic accident prediction in Berlin

Machine Learning

We trained several machine learning models to predict the risk of occurrence of a traffic accident on a road segment for a given hour and day in Berlin.

Ma. Adelle Gia Arbo 216132@hertie-school.org , Helena Bakic bakic.helena@outlook.com , Benedikt Ströbl 213712@hertie-school.org
2022-05-11

Problem

Traffic accidents are one of the leading causes of death and injuries globally, and in Germany they cause over 3,000 deaths and 300,000 injuries yearly. Studying traffic accidents and where and why they happen can help policy makers make our roads safer.

There are at least two key issues to tackle when predicting traffic accidents. First decision to make is on the granularity, or spatial precision, to which we predict accidents. One approach could be focusing only on a few select roads that are covered by sophisticated measurement instruments like loop detectors and cameras. This allows researchers to precisely measure important features such as traffic flow and lane occupancy. Another approach could be predicting accidents, or their severity, on datasets with millions of accidents aggregated at the level of the country. This approach helps uncover patterns that would be difficult to find on smaller datasets, but lacks precision that could help policy makers improve traffic safety at the level of a community or a city.

Second key issue with traffic accident prediction is a computational one. Although traffic accidents have serious consequences they remain, luckily, rare events. However, many machine learning models do not do well when predicting rare events – in part due to the focus on minimisation of the overall error rather than the detection of the rare class. Fortunately, computational approaches and models that tackle imbalanced datasets do exist and seem to be successful in predicting rare events.

We wanted to build on previous work in two main ways. First, inspired by a recent paper (Hébert et al. 2019) we applied a novel approach in traffic accident forecasting that aims to predict the risk of an accident on a particular road segment withing a city road network at a particular day and hour. We believe that this approach is particularly useful for policy makers interested in road-safety improvement. Second, we explore and provide new evidence on the performance of different approaches for rare event prediction in the field of traffic accident forecasting.

Data

We conducted this study using the example of accidents in Berlin using the following data sources:

(a) Collision points (above), (b) Road Network (below)(a) Collision points (above), (b) Road Network (below)

Figure 1: (a) Collision points (above), (b) Road Network (below)

Method

The workflow of this study can be seen in Figure 2.

Study workflow

Figure 2: Study workflow

Generation of negative examples

After we collected raw data, we first generated negative examples, that is, data points when accidents did not happen using Apache Spark with Pyspark. In traffic accident forecasting only positive examples (that is, occurrence of the accidents) are recorded. This makes sense, particularly as most of such records come from a traffic authority responding to an accident. However, that also means that available datasets have no records of negative examples at all – so they have to be generated. Luckily, having an accurate database of accidents (recorded by the Berlin police), means that we know when accidents did not happen – and that is at all possible other locations and times that were not recorded as accidents! Of course, it is not feasible to analyse all examples where accidents did not happen, as that would be computationally difficult. What we did instead, following the usual approach, was to calculate all possible combinations of time and locations for negative events and then drew a sample from it. We tried different accident/non-accident rations, as we will describe later, but in the end the one we worked with was 5 non-accidents for every accident.

Data pre-processing

The major part of data pre-processing included matching the accident GPS-locations to existing road segments in Berlin with geopandas. This was done by creating a spatial join function and specifying a buffer. The matching function finds which road segments intersect the collision point’s circle area and match these segments to the collision point. Increasing (decreasing) the buffer value makes the circle area larger (smaller), thereby increasing (decreasing) the number of road segments matched to a collision point. For example, setting the buffer to 2m (Figure 3a) identifies fewer road segments than setting the buffer to 20m (Figure 3b). A bigger buffer was also more correct in detecting accidents that occurred on road intersections. Different buffer values were used to make sure that we match all accident and non-accident GPS-locations to corresponding segments.

(a) Matching 2m buffer (left), (b) Matching 20m buffer (right)(a) Matching 2m buffer (left), (b) Matching 20m buffer (right)

Figure 3: (a) Matching 2m buffer (left), (b) Matching 20m buffer (right)

We then matched all other data to road segments’ midpoints. For weather features, we used the data for the closest weather station, and if that data was missing, second closest. Since Berlin accident dataset does not report the exact day of the month when the accident happened, we computed the monthly average per hour and used that information instead. Time features were recoded cyclically using sine and cosine functions to account for the fact that the extreme values (e.g. hour 1 and 23) have a similar meaning. The features we included in the model were:

  1. month_cos and month_sin: cosine and sine of cyclical encoded month
  2. hour_cos and hour_sin: cosine and sine of cyclical encoded hour
  3. weekday: day of the week (Monday = 1, Sunday = 7)
  4. collision_cnt: number of accidents at a road segment in the previous year/s
  5. length_m: lenght of a road segment in meters
  6. side_strt: road segment is a side street (1) or a main road (0)
  7. temperature (°C)
  8. humidity (%)
  9. visibility (m)
  10. prec_height: average mm per 10 minutes intervals in the hour before event
  11. prec_duration: average minutes per 10 minutes intervals in the hour before the event
  12. sun_elevation_angle: solar elevation angle in degrees
  13. collision: the target we are predicting; accident occurred (1) vs. not (0)

We then split the dataset into training (years 2018 and 2019) and test (year 2020). The final training dataset had a total of 52,118 road segments where accidents occurred and 260,702 where they didn’t. Here’s a snippet of the final dataset we were working with:


X segment_id year month_cos month_sin weekday hour_cos hour_sin collision_cnt side_strt sun_elevation_angle humidity temparature visibility prec_height prec_duration collision
0 42796 2019 0.866 -0.5 6 0.963 0.270 0 1 -53.638 86.187 4.081 20049.46 0.009 2.242 0
1 34322 2020 0.866 0.5 6 0.460 0.888 0 1 -36.877 88.226 3.458 23919.35 0.003 0.876 0
2 29497 2018 -1.000 0.0 2 -0.776 0.631 1 1 35.339 55.533 20.693 34946.67 0.002 0.431 0
3 39002 2020 0.000 -1.0 6 -0.335 0.942 0 1 1.920 76.000 14.227 36333.33 0.006 0.911 0
4 32881 2019 -0.866 -0.5 2 1.000 0.000 0 1 -14.045 72.806 17.171 40709.68 0.001 0.113 0
5 41584 2019 0.000 -1.0 5 -0.335 0.942 0 0 2.486 78.138 13.845 33616.67 0.005 0.928 0


And here’s what the variables’ distributions look like:

Histograms of features for the training dataset

Figure 4: Histograms of features for the training dataset

Finally, we can already get a glimpse of whether or not some road segments are more dangerous than others. As we can see from the Figure 5 below, it does seem that some main, long road segments – that lead around the city centre or in it – are more dangerous than small, side road segments. But, to see how correct we are when we make this prediction, we need to train classification models.

<<<<<<< HEAD Road segments with the highest (above, n = 1300) and lowest (below, n = 1300, random selection) number of accidents ======= Road segments with the highest (above) and lowest (below) number of accidents >>>>>>> 225f9230b6ecdac949d736864ca72fc4f59d762b

Figure 5: Road segments with the highest (above, n = 1300) and lowest (below, n = 1300, random selection) number of accidents

Selected models

We implemented 7 machine learning models:

  1. Logistic Regression (Logit)
  2. Standard Random Forrest (SRF)
  3. Balanced Random Forrest (BRF)
  4. Standard Random Forrest with SMOTE and RUS (SRF_SMOTE_RUS)
  5. Balanced Bagging (BB)
  6. Support Vector Classifier with SMOTE and RUS (SCV_SMOTE_RUS)
  7. Extreme Gradient Boosting (XGBoost)

The models we selected are widely used for various problems and also in prediction of rare events. However, a small note on SMOTE and RUS techniques will be made. Synthetic Minority Over-sampling Technique or SMOTE and Random Under-sampling or RUS are re-sampling techniques that have been found to work well for imbalanced datasets. A discussion of these techniques is beyond the scope of this article, but a detailed discussion can be found elsewhere (Chawla et al. 2002). For this study, we applied SMOTE and RUS using 30% over-sampling and 50% under-sampling. In this case that means that on a training dataset containing 300,000 observations with imbalance factor of 5 (250,000 negative:50,000 positive), we first re-sample the minority class to have 30% the number of the majority class (75,000), then use random under-sampling to reduce the number of examples in the majority class to have 50% more than the minority class (150,000).

Hyperparameter tuning

We used random- and grid- search with 5-fold cross validation to tune our hyperparameters. As a scoring metric we chose the Area Under the ROC curve. We parametrized our search space by different hyperparameters unique to each model architecture; for instance, in Random Forest models, the number of trees, maximum depth, and minimum samples required to be at a terminal node, among others, were optimized. Moreover, we parametrized the sampling strategies - the intended class imbalance factor after re-sampling - that is applied to re-sample the dataset for our model implementations that use SMOTE and RUS.

Evaluation approach

We evaluated our models by calculating the Area under the ROC curve (AUC-ROC) that is considered a good, general measure for classification problems. Additionally, we aimed for a model with a high Recall, which also implies a relatively higher false positive rate (FPR), and a lower prediction. As traffic accident prediction is a type of rare event prediction, our goal was to train a model with a high Recall since false positives can also correspond to high-risk circumstances which we also want to detect and, ultimately, avoid. An overview of these performance metrics and the corresponding confusion matrix can be seen in Figure 6.

Calculation of Precision, Recall and F1-score in the Confusion Matrix

Figure 6: Calculation of Precision, Recall and F1-score in the Confusion Matrix

Results

Before training the models on our dataset, we conducted an experiment to compare the performance of the baseline model with different ratios of positive and negative events (accidents vs. non-accidents). We compared results from a logistic regression model and a random forest model with two different imbalance ratios that were previously described in the literature: around around 2:1 (Santos et al. 2021) and 17:1 (Hébert et al. 2019). For this comparison we used only time features (year, month, weekday and hour) and the number of accidents per road segment.

Our results indicated that the random forest model yielded the highest Recall for both imbalance ratios (70.4%), but that Precision was largely reduced by a high imbalance ratio. Similar trend of results, but with much worse evaluation metrics, was obtained by the logit model. AUC_ROC scores did not change much with the imbalance ratio. Therefore, we concluded that our models did not seem very sensitive to the imbalance factor of our data. This corroborated our approach of drawing a uniform random sample of non-accidents.

Finally, we trained our models on the training set and compared their performance on the test set. The results in Table 1 and the Figure 4 show the following:

Therefore, we concluded that the best performing model was the Balanced Random Forest (BRF) that predicted 94.5% of traffic accidents in Berlin.


Table 1: Comparison of Classification Models
Model AUC_ROC Precision Recall F1_score
Logit_baseline 88.3% 58.1% 88.3% 70.0%
SFR_baseline 83.6% 73.3% 72.0% 72.3%
Logit 86.8% 67.6% 80.6% 73.6%
SRF 84.0% 77.5% 71.9% 74.6%
BRF 88.7% 50.4% 94.5% 65.7%
BB 80.5% 40.4% 83.7% 54.5%
SRF_SMOTE_RUS 87.4% 68.1% 81.8% 74.3%
SVC_SMOTE_RUS 75.4% 89.3% 52.0% 65.7%
XGBoost 85.4% 75.4% 75.4% 75.4%


Precision-Recall (above) and ROC Curves (below) of the 7 Classification ModelsPrecision-Recall (above) and ROC Curves (below) of the 7 Classification Models

Figure 7: Precision-Recall (above) and ROC Curves (below) of the 7 Classification Models

Analysis

Figure 8 below shows the best features in our final model.

Feature importance in the Balanced Random Forest model

Figure 8: Feature importance in the Balanced Random Forest model

Our results echo those of the previous studies: previous number of accidents at a road segment was by far the most predictive feature. Some other road infrastructure features closely related to the number of accidents, such as segment length and whether a road is a side street, were also important. From non-infrastructure related features, sun elevation angle was the most important one. Unlike in previous studies, weather features and time features were a bit less important here. One reason that might be is that Berlin dataset does not record an exact day of the year when the accident happened, only the day of the week. This also meant that the weather features had to be averaged, which of course reduced their precision.

With respect to our approach of applying combined re-sampling approaches - SMOTE and RUS - and a diverse set of models - linear, tree-based, SVC, XGBoost, Balanced Bagging - we achieved good performance with almost all models. However, we could not improve performance to established models such as Balanced Random Forest, which has been the overall best-performing model in terms of chosen evaluation metrics.

Conclusion

Our study used several publicly available datasets to build a high-resolution traffic accident prediction models using 7 classification models. Our best-performing model, the Balanced Random Forest, yielded a 94.5% Recall at predicting road accidents in Berlin on a given road at a given date and hour with a 50% Precision. For future research, we suggest incorporating new features to describe the traffic flow and urban infrastructure (i.e., number of schools, proximity to central business districts, number of traffic lights, etc.) around the area where the accidents occurred, since these seem to be largely neglected in existing traffic prediction studies.

Ultimately, we hope that our project can be useful to policy makers for identifying the most dangerous road segments per hour of the day and, thus, be a basis for taking action on reducing risk of accidents. Our work also shows that providing administrative data - through initiatives such as Open Data Berlin - can create value to the general public.

Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57.
Hébert, Antoine, Timothée Guédon, Tristan Glatard, and Brigitte Jaumard. 2019. “High-Resolution Road Vehicle Collision Prediction for the City of Montreal.” In 2019 IEEE International Conference on Big Data (Big Data), 1804–13. IEEE.
Santos, Daniel, José Saias, Paulo Quaresma, and Vı́tor Beires Nogueira. 2021. “Machine Learning Approaches to Traffic Accident Analysis and Hotspot Prediction.” Computers 10 (12): 157.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/benediktstroebl/Machine-Learning-Project-Group-F, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".