Detecting hate speech on Twitter using Bidirectional Encoder Representations from Transformers (BERT)

Abstract

The following project devises a method of detecting hate speech on Twitter, using deep-learning algorithms. In this regards, the project proposes the Bidirectional Encoder Representations from Transformers (BERT), which is the state-of-the-art method for the majority of NLP-associated tasks. We contrast the results of our proposed BERT architecture with the more traditional Support Vector Machines approach, trying to highlight the improved accuracy offered by deep-learning algorithms. Additionally, we try to optimise the SVM by using complex pre-processing methods, which lead to accuracy levels similar to that of the unoptimised BERT algorithm. This allows us to ultimately see whether BERT’s improved accuracy remains significant even after the pre-processing.

Introduction

Background

Twitter is considered by people to be one of the most popular social-media platforms used currently in the English-speaking world, and it has significantly improved the capacity in which normal people communicate and share opinions, ideas, and sources of information. Nevertheless, given the low levels of active moderation, the very high number of users and the democratic nature of speech transmission, Twitter has come to be exploited by malintended individuals. In this sense, the dissemination of offensive, aggressive and even hateful content has become a staple of Twitter, with policymakers trying to combat this wave of digital aggression.

This untamed distribution of hateful content is indeed one of the biggest problems concerning the content present in the digital world. It has the potential of negatively affecting disenfranchised groups, who already suffer due to the stereotypes in society. However, regardless of the policies that need to be adopted to prevent these development, and regardless of the ethical discussions on the limits of free speech, existing hate speech needs to be detected on Twitter, in order for any measure to be functional. Given the scale of the phenomenon on social media, the best solution remains the automatic detection of hate speech.

Once again, this process is not straightforward .The automatic detection of hate speech is a convoluted and challenging task due to disagreements existing over the different hate speech definitions. Therefore, some content published on Twitter might be hateful to some individuals and not to others, based on the definition of hate speech they employ. Several academic papers and other research projects that rely on traditional machine learning approaches (such as bag of words, and word and character n-grams) have been published in the last decade. Recently however, the proffered methodological setups involved deep-learning algorithms, such as LSTM-based RNNs, or BERT.

Research scope

In this project we deploy deep-learning methods for detecting hate-speech on Twitter, focusing on the English language. We make use of existing implemen tation of BERT, an architecture based on the Transformers model. To precisely characterise the efficiency and effectiveness of our implementation of BERT, we use numerous automatic metrics that allow us to see both if the model identifies hate-speech, and if it avoid conflating hate-speech with non-hateful speech. In this sense, our project is interested in both false positives and false negatives, which is an approach rarely seen in the field.

In addition to our implementation of BERT, we also develop a more traditional “shallow” approach, in the form of SVMs. In this regard, we are able to actually see the improvement of using the state-of-the-art in the field of NLP.

One novel approach brought by our project is that in the case of the SVM, we use complex and comprehensive pre-processing methods that improve as much as possible the prediction power of the model. As such, we are able not only to compare SVMs with BERT, but to compare the best possible SVMs with the latter deep-learning architecture.

Related Work

Given the rapid and exponential use of social media, researchers have utilized different machine learning methods and deep-learning architectures for social media challenges. These deep-learning architectures have been used to target those challenges: detection of racism, white supremacism, sexism, moral harassment or incitement to hatred etc. Different periods were also studied to understand if it led to an upsurge in racist tweets as we saw with the resurgence of cyber racism during the COVID-19 pandemic. Therefore, our proposed model of applying BERT in racism and non-racist tweets is based in the latest literature in the field, as represented by papers such as the following:

Kwok and Wang (2013) focused on anti-black racism. They used a supervised machine learning approach (i.e., Naıve Bayes classifier/”bag-of-words) to label tweets as “racist’ and ‘“nonracists” based on unigrams resulting in an average accuracy of classification of about 76%. Nonetheless, many of these unigrams are not necessarily “racists” based on the context they are being used (i.e., the word “gay” for example).
Davidson et al. (2017) constructed unigrams, bigrams, trigrams, and a sentiment lexicon to assign sentiment scores to each tweet for three categories: offensive, hate, and neither. They used different models used in previous literature (logistic regression, naive Bayes, decision trees, random forests, and linear SVMs) to test the performance. They finally focused on the logistic regression with L2 regularization for the final model, which had a precision of 91% for offensive while hate categorization performed poorly with 44% hate speech being misclassified. Again, tweets with hateful contents were misclassified because of the context, broad definition, and the use of slurs in everyday communication (i.e., coming from rap songs, for example).
Badjatiya et al. (2017) applied deep learning architecture (i.e., FastText, CNN, and LSTM) to multiple classifiers with different semantic embeddings (i.e., ngrams, Term Frequency-Inverse Document Frequency, Bag of Words). The best method was LSTM (with Random Embedding and GBDT) which reveals a precision of 91.3%.
Devlin et al. (2019) improved the fine-tuning based approaches by proposing a new model representation model: BERT (Bidirectional Encoder Representations) from Transformers. They show that the model achieved state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. In order to prove the effectiveness of their methodology, they test the BERT on eleven natural language processing tasks and prove that BERT outperformed in all of the specific tasks.
Mozafari et al. (2019) use BERT in order to develop an efficient automatic hate speech detection model. The authors use two different datasets of respectively 16k and 25k tweets provided by Davidson et al. (2017) and Waseem et al. (2018) which are annotated as racism, sexism or neither and train their model such that it detects and classifies automatically the tweets in their respective categories. The results reveal that the BERT model with a CNN-based fine tuning strategy provides the best results as F1- score of 88% and 92% for the two datasets. BERT models outperform traditional methods as it analyses the tweet’s information which contains both syntactic and contextual features coming from lower layers to higher layers of BERT.
Isaksen and Gamback (2020) uses four deep learners based on the BERT, with either general and domain-specific language models. These four deep learners were based on two datasets of 25k and 100k provided by Davidson et al. (2017) and Founta et al. (2018). All four models achieved F1-scores close to or above the current state-of-the-art models. Additionally, attention-based models tested in the study confused profoundly hate speech with offensive and normal language which makes it hard to bring the system into practical use. It is also worth noticing that the pre-trained models outperformed generally in terms of accuracy for the hateful instances.
Saleh et al. (2020) apply the same deep-learning methods to detect hateful tweet on a collected dataset of 1 M tweets from white supremacist accoynts and hashtags which was reduced to 2k tweets after annotation. They first use domain-specific word embedding learning from the corpus and then classify the the tweets using a Bidirectional LSTM-based deep model and then used a pre-trained BERT which is fine-tuned on the white supremacist dataset using a Neural Network dense layer. BERT outperformed all the distributional-based embeddings (Google Word2Vec, GloVe and WSW2V) with the Bidirectional LSTM-based with an accuracy of 86.4%. This implies that the model provides a better meaningful vector of the words due to its training strategy (deeply bidirectional) and the large corpus it was trained on.

Proposed Method

SVMs

In terms of a baseline method, we chose to use Support Vector Machines (SVMs). Given a training dataset, each observation labeled as to belong to one of two general categories, an SVM algorithm builds a model that assigns all the observations from a testing dataset to one category or another, making it, therefore, a non-probabilistic binary linear classifier.

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique for natural language processing developed by Google in 2019. In turn, Transformers are encoder-decoder architectures developed in 2017 The encoder consists of a set of encoding layers that processes the input iteratively one layer after another and the decoder consists of a set of decoding layers that does the same thing to the output of the encoder. The introduction of Transformers, and specifically BERT, has led to the replacement of previous models such as LSTM-based RNNs.

As opposed to the majority of the architectures that have preceded BERT, this algorithm does not make use of directional models, which read the text input sequentially, either left-to-right or right-to-left. BERTs encoder reads the entire sequence of words at once, making it bi-directional—simultaneously left-to-right and right-to-left.This particular construction feature allows the BERT model to learn the context of a word based on all of its surroundings, regardless of position in relation to the word.

Experiments

Data: Given the scope of our study, we made use of dataset provided by Davidson et al. (2017), which contains tweets that are characterized as being either hateful, offensive or neither of these. As hate speech is a contested term, the authors define it as language that is used to expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group. This definition avoids a narrow focus on only the most extreme usage of the English language, as accepting the existence of instances of offensive language where people use terms that are highly offensive in a qualitatively different manner.

The dataset provided already annotated data, process which was fulfilled manually by CrowdFlower workers. Three or more people coded each tweet where the intercoder-agreement score provided by CF is 92%. Davidson et al. (2017) decided to assign labels to each tweet based on the majority decision, which given the high intercoder-agreement score was easily achieved. The dataset we used is composed of 24,802 labeled tweets. Given the focus of our study, specifically the automatic detection of hate speech, and not offensive language, we dropped tweets that were labelled as being offensive and not hateful. This action generates a final dataset consisting of 1,430 tweets labeled as hate speech and 4,163 tweets labeled as neither hateful or offensive.

The large number of unnecessary features, as well as the typical way in which language is used to generate tweets implies a need for a serious and comprehensive process of pre-processing. The pre-processing leads to clean textual data that is easily transformed to numeric arrays. Corollary text pre-processing offers the opportunity for the model’s optimality to be tested. Additionally, this will further allow us to improve the accuracy of “shallow” machine-learning algorithms.

Evaluation method: The evaluation of our models, both the SVMs and the BERT, has been achieve through a diverse set of automatic metrics. First and foremost, in both cases we have computed the accuracy of the model, namely the number of tweets in the testing dataset that are correctly labelled by our proposed models. This is the primary metric of evaluation for this type of NLP task, and allowed us to compare our results with the ones already existing in the literature. However, given the additional task we have performed, of designing and optimal “shallow” model in order to see how well they perform contrasted with deep-learning architectures, we also uses additional metrics focused on the SVMs.

Results:

The table above shows the results of our multiple models. As such, we describe the overall accuracy of the SVMs (both with and without the pre-processing mechanisms that we have described in the steps above), and the BERT. One general observation is that BERT achieves results that are close to the best results in the literature, while the versions of SVM that have used pre-processing achieve an accuracy above similar “shallow” architectures we have identified.

Comment on quantitative results: First and foremost, we can infer that using deep-learning methods such as BERT outperforms even the best attempts made using “shallow” machine learning algorithms. The differences in accuracy are large, and when translated into policy, would imply significant differences regarding the ease of implementation.

Secondly, the results of the SVMs using complex pre-processing methods are significantly higher than expected. This also implies that the increase in accuracy when contrasting BERT and the pre-processed SVMs is significantly smaller than when comparing BERT and basic SVMs. While the results of the SVMs are nowhere near to the state-of-the-art results of BERT, this is the case for the English language. When trying to automatically predict hateful content in other languages, in the absence of more complex pre-trained Transformer models or in the absence of significant GPU resources, it might be sufficient initially to use basic SVMs, as long as the pre-processing step has been thorough.

Thirdly, the results of the BERT model are rather static and independent of the pre-processing methods used. We have tried multiple experiments using differently pre-processed versions of our original dataset, and the differences in accuracy have not been significant. This is to be expected, but it’s important to acknowledge this in practice, as it can prevent people from wasting time on tasks that do not improve the process of automatic detection of hateful speech on social media.

Analysis

In order to analyse the efficiency and effectiveness of our approach to automatic hate speech detection on Twitter, we need to properly understand the baseline model and how the results of the baseline model appear based on our expectations. Looking at the figure below, we see that our baseline method correctly predicted about 77 percent of tweets that related to hate speech. On the other hand, 98 percent of tweets that did not relate to hate speech (classified as neutral) were predicted correctly. In this light, about 2 percent of tweets that were related to hate speech were classified as neutral, while about 23 percent of neutral tweets were classified as hate speech. Ultimately, this means that the problem with the SVMs (both with and without pre-processing) was that it had serious trouble correctly identifying and labelling neutral tweets. While this is not absurd, and was to be expected based on the literature review, it’s still problematic that the the prediction power of the baseline model lacks balance.

This is even more of a reason for using BERT. In the case of BERT, having such a high accuracy implies that we have, regardless of our optimisation strategy, statistically less chances of reaching an unbalanced prediction pattern. Therefore, by using the latest versions of BERT, and building on previous results, we are able to achieve an accuracy of 91.6% distributed uniformly across the categories studied. This was achieved both by balancing the original dataset, and by leaving the dataset unbalanced, proving once again that the BERT architecture is remarkably flexible and not pretentious at all.

The figure above show that the accuracy cannot increase constantly, and after a number of epochs it’s a matter of very small improvements, if any. This means that we would need, for better performance, a more expansive dataset, or another pre-trained version of the model that is more used with dealing with Twitter. Both extensions are worht discussing in future endeavours.

Conclusion(s)

Our project proves that while the automatic detection of hateful content on social media remains a complicated tasks, and current models are imperfect, we can confidently tackle this challenge. By using the most advanced deep-learning methods, we are able to design models that reach accuracies above 90%, in line with the best results present in the literature. Additionally, we show that deep-learning is superior to other approaches that might be considered by researchers, such as SVMs. Nevertheless, pre-processing can increase the accuracy of the SVM algorithm, which could be useful for researchers working in languages other than English.