Predicting Drug Consumption Through Personality

Machine Learning
Five Factor Model

“We present a machine learning approach to predict the consumtion of five categories of drug using personality traits.”

Dominik Cramer https://github.com/DominikCramer , Francesca Giacco https://github.com/francescagiacco , Lorenzo Gini https://github.com/zazzooo
2022-05-11

Abstract

This project aims to predict consumption of various groups of drugs through personality data. The intuition behind this idea is that a certain type of personality might look for certain effects more than a specific drug. The underlying dataset contains data on the consumption of 18 legal and illegal substances, collected mainly in the US and the UK.

Introduction / Background

Every year, tens of thousands of deaths are caused by alcohol and drug abuse in the US and UK (Ritchie & Roser, 2019). In each of these countries efforts to reduce these numbers cost 20 to 25 billion Euro per year (Department of Health and Social Care, 2021). At the same time, drug policies appear to be rather expensive and ineffective in most western societies since they mostly target drug consumers in an advanced state of their addiction. Nevertheless, costs of drug prevention and education are often difficult to justify due to the prevention paradox. Being able to predict which individuals are more susceptible to which drugs would make them easier to justify while also allowing policy makers to cut expenses by targeting prevention measures only to those who need it. Personality traits are some of the most interesting features for predicting drug consumption because they are highly influential and out of reach of any policy. Indeed, there have been several studies linking personality traits to drug (ab)use (inter alia (Navlani, 2018) (Turiano et al., 2012)). This project draws primarily on the paper of Fehrman et al. 2017 on the relationship of personality traits and the drug consumption risk. Personality traits were measured with the five factor model (Roccas et al., 2002) complemented by an impulsiveness score and a sensation-seeking score. As part of their study, Fehrman et al. collected the dataset on drug consumption in the United States and United Kingdom used for this project. Even though several papers on the topic were published, there is no second comparable dataset publicly available. The correlation between the different substances can be seen in the correlation plot.

Correlations between substances

Figure 1: Correlations between substances

What is new about this project is that not single drugs are predicted but categories of them according to their main effect.

Methodology

The drugs were classified into five categories according to their primary effect namely depressants, hallucinogens, stimulants and two additional variants: depressant without alcohol, and hallucinogens without cannabis. These categories are not mutually exclusive but instead used to account for different motives behind their consumption. The latter two categories exclude the most popular drugs that have much higher accessibility, so they might be less comparable with the other substances.

We distinguished between heavy users from the rest as this should become the target population of drug prevention policy. It was selected drug-specifically thresholds based on the drug’s potential for addiction and overall physical and psychological harm. Finally, columns containing the predicted value were created: an individual is a heavy user of no drug in one of the categories, they are not considered an addict. (value = 0) an individual is a heavy user in exactly one drug in one of the categories, they are considered an addict. (value = 1) an individual is classified as a heavy user of at least two drugs in one of the five categories they are considered a serious addict. (value = 2)

This distinction was used in all the models predicting three classes. For the models predicting only two classes, the differentiation between serious addicts and addicts was omitted.

Imbalance of classes

Figure 2: Imbalance of classes

Imbalance of classes

Figure 3: Imbalance of classes

These classes were highly imbalanced. Since the dataset as a whole is rather small, we decided to address this imbalance generating synthetic data for the minority class using the Synthetic Minority Oversampling Technique (SMOTE). As our baseline model, we chose a k-neighbor classification with nine neighbors. We ran more than 100 different models. Other than try different classification method, all models were built including cross validation, thereby making the results more robust. Apart from the baseline model, logistic regression, XGBoost, decision tree, and random forest classifiers were deployed. Moreover, support vector machines (SVM) with different kernels were run (linear, polynomial, radial basis function, sigmoid function).

Experiments

Data

The dataset contains demographic information including age, ethnicity, country, gender and educational level. Then, three indexes measuring personality data are included. These are an impulsiveness score, a sensation-seeking score (ss-score), and the five factor model. The latter measures personality according to five factors, namely Neuroticism (n-score), Extraversion (e-score), Openness to experience (o-score), Agreeableness (a-score) and Conscientiousness (c-score). These 7 indicators of personality are the predictors we are more interested in. Finally, we have data on the last instance of consumption of several drugs. These take values ranging from 0 to 6, which correspond respectively to ”never used”, ”used over a decade ago”, ”used in the last decade”, ”used in the last year”, ”used in the last month”, ”used in the last week”, and ”used on the last day.” The data were collected via snowball sampling. This is not unusual in the area of research. Unfortunately, snowball sampling comes with several negative implications on data quality. Most importantly, this causes community and self-selection bias. Also, drug consumption was measured as a dichotomous variable, meaning that the quantity of the drugs consumed is unknown. With just under 2,000 observations, the dataset is relatively small compared to other machine learning datasets. Nevertheless, snowball sampling is a useful and legitimate tool in the area of interest since some social systems are beyond researchers’ ability to recruit randomly. In the area of interest, it is even inevitable due to the illegality of consumption of some of the drugs in question. This also explains the size of the dataset.

ROC curve

Figure 4: ROC curve

Evaluation method

To account for both sensitivity and specificity and the persisting issue of imbalanced classes, the Matthews Correlation Coefficient (MCC) is best suited. Since personality traits do not hold the complete predicting power regarding drug addiction, it would be concerning to see an MCC score very close to one. Moreover, the small size of the dataset does not allow for exhaustive training. Accordingly, we regard an MCC score above 0.5 as a satisfactory result (Fig. 5).

Results

We identified the best models for each category of drug and tuned their parameters via grid search. For stimulants, these models are the random forest predicting two classes (MCC: 0.541) and the baseline model (MCC: 0.478). After the grid search, the former reached an MCC of 0.5732 while the latter scored 0.586. For hallucinogens, the logistic regression excluding race reached an MCC of 0.595 before the grid search and 0.627 afterwards. When excluding cannabis from this category, the two best performing models are the random forest (0.56) and the logistic regression excluding race (0.62). After tuning, the former scored an MCC of 0.578 and the latter reached 0.696. For depressants, the MCC of the logistic regression jumped from 0.515 to 0.558 after tuning. Removing alcohol from the depressant category creates a very different setup. The SVM and the random forest computed excluding race score the highest. Our grid search for the SVM consists of trying all the different kernels. The best MCC score is reached when the kernel used is ‘rbf’ (radial basis function). This model reaches a MCC of 0.654. The random forest scores 0.629.

Comment on quantitative results

Our original idea was to produce a three classes model for each of our three categories (stimulants, depressants, hallucinogens) which then became five (addition of depressants without alcohol and hallucinogens without cannabis). Our intuition was that being able to distinguish between moderate and serious addiction would have been more insightful and possibly more useful for application in policy making. However, after running 60 different models with binary classification and 60 models with three classes, we realized that the two classes-models produced systematically better results. Indeed, by using the three-classes models, the MCC of our models drops by 0.17 on average. This result came with no surprise, as we were aware of the limited size of the dataset and that the number of people not consuming drugs always exceeded the number of “addicts”, leading to very imbalanced classes. For this reason, we only conducted the grid search and the final best-model selection on the binary classifiers. We found that the predictions for depressants consumption were significantly more precise when excluding alcohol (MCC was 0.139 higher on average). By contrast, the predicting power of our models decreased slightly but systematically when excluding cannabis from the group of hallucinogens (MCC was 0.072 lower on average).

Analysis

As the logistic regression was always among the best models, it served as a point of comparison for every category of drugs. Stimulants A random forest performed best in this category, but the logistic regression achieved similar results (MCC: 0.587). While tuning the parameters, l1 regularization produces the best results. Some personality scores shrink to 0, as well as most nationality data. The consumption of stimulants is positively correlated with the sensation-seeking score (0.55), and the o-score (0.26) and negatively correlated with the c-score (-0.28) and age ( -0.71).

Hallucinogens

For the prediction of hallucinogen’s consumption, a Logistic Regression performed best. It is positively correlated with the ss score (0.32) and the o-score (0.61)and negatively correlated with age (-0.62) and education (-0.36).

Hallucinogens without cannabis

Even using a l1 regularization, none of the personality traits scores shrunk to 0, showing that personality clearly has an impact on hallucinogens consumption. As before, the o-score and ss-score have the highest impact (0.56 and 0.44). An interesting finding is that the negative correlation with age increases (-1.19). This means that the consumption of hallucinogens is mainly driven by young people.

Depressants

Here personality traits seem to have conflicting effects. o-score and ss-score continue having a positive impact but this time with lower magnitude (0.38 and 0.25). This time however, a-score and e-score are negatively correlated to consumption (-0.12 and -0.15) as well as age and education (-0.12, -0.12).

Depressants without alcohol

O-score and ss-score correlate positively (0.57 and 0.23). Removing alcohol leads to a higher correlation between the o-score and consumption. For the first time, impulsiveness has a non-neglectable positive effect (0.13), while the e-score (-0.15), age (-0.30) and education (-0.38) correlate negatively.

Evaluation

Removing the most popular drugs increased the magnitude of the negative impact of age and education. The o-score and ss-score impact consumption positively for all types of drugs. Interestingly, the e-score seems to only matter for depressants while impulsiveness only has an impact on depressant’s consumption when excluding alcohol. The a-score only affects depressant’s consumption when including alcohol, indicating that a high a-score might have a direct effect on alcohol consumption.

Conclusion

Personality traits have significant predicting power concerning drug consumption. Openness to new experiences and sensation-seeking are especially informative. Apart from that, age, gender and nationality hold high predictive power. This project shows that predicting drug categories instead of single substances is beneficial. Unsurprisingly, the central limitations are found in the data itself. Larger and more diverse datasets on drug consumption and possible predicting factors are urgently needed to refine the promising models deployed in this project up to a satisfactory level. Unfortunately, the current drug policies in most countries severely inhibit the collection of reliable and representative data. Apart from improvements in data quantity, quality, and the resulting model performance, more research that infers causality is needed to validate these models and shape an effective policy plan.

References

Department of Health and Social Care. (2021, December 6). Largest ever increase in funding for drug treatment. GOV.UK. Retrieved May 11, 2022, from https://www.gov.uk/government/news/largest-ever-increase-in-funding-for-drug-treatment

Navlani, A. (2018, August 1). KNN Classification Tutorial using Sklearn Python. DataCamp. Retrieved May 11, 2022, from https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn

Ritchie, H., & Roser, M. (2019, February). Causes of death. Our World in Data. Retrieved May 11, 2022, from https://ourworldindata.org/causes-of-death

Roccas, S., Sagiv, L., Schwartz, S., & Knafo, A. (2002, June 1). The Big Five Personality Factors and Personal Values. https://journals.sagepub.com/doi/10.1177/0146167202289008

Turiano, N., Whiteman, S., Hampson, S., Roberts, B., & Mroczek, D. (2012, June 1). Personality and Substance Use in Midlife: Conscientiousness as a Moderator and the Effects of Trait Change. Journal of research in personality, 46(3), 295–305.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/francescagiacco/ML-project-personality-and-drug-consumption, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".