We present a deep learning approach to classifying labeled texts and phrases in party manifestos, using the coding scheme and documents from the Manifesto Project Corpus.
Hand-labeled political texts are often required in empirical studies on party systems, coalition building, agenda setting, and many other areas in political science research. While hand-labeling remains the standard procedure for analyzing political texts, it can be slow and expensive, and subject to human error and disagreement. Recent studies in the field have leveraged supervised machine learning techniques to automate the labeling process of electoral programs, debate motions, and other relevant documents. We build on current approaches to label shorter texts and phrases in party manifestos using a pre-existing coding scheme developed by political scientists for classifying texts by policy domain and policy preference. Using labels and data compiled by the Manifesto Project, we make use of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) in conjunction with Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) to seek out the best model architecture for policy domain and policy preference classification. We find that our proposed BERT-CNN model outperforms other approaches for the task of classifying statements from English language party manifestos by major policy domain.
During campaigns, political actors communicate their position on a range of key issues to signal campaign promises and gain favor with constituents. Whilst identifying the political positions of political actors provides no certainty with regards to whether they act upon their policy preferences, it remains essential to understanding their intended political actions. Quantitative methods, especially in the field of natural language processing, have enabled the development of more scalable methods for predicting policy preferences. These advancements have enabled political scientists to analyze political texts and estimate their positions over time (Nanni et al. 2016; Zirn et al. 2016). To better understand the political positions of political actors, many social science researchers have turned to hand-labeling political documents, such as parliamentary debate motions and party manifestos. Much of the previous work on analyzing political texts relies on hand-labeling documents (Abercrombie and Batista-Navarro 2018; Gilardi, Füglister, and Luyet 2009; Krause 2011; Simmons and Elkins 2004). Yet, the analysis of political documents in this field stands to benefit from automating the coding of texts using supervised machine learning. Most recently, neural networks and deep language representation models have been employed in state-of-the-art approaches to automatic labeling of political texts by policy preference.
In this article, we present a deep learning approach to classifying labeled texts and phrases in party manifestos, using the coding scheme and documents from Manifesto Project (Volkens et al. 2019). We use English-language texts from the Manifesto Project Corpus, which divides party manifestos into statements—or quasi-sentences—that do not span more than one grammatical sentence. Based on the state-of-the-art deep learning methods for text classification, we propose using Bidirectional Encoder Representations from Transformers (BERT) combined with neural networks to automate the task of labeling political texts. We compare our models that combine BERT and neural networks against previous experiments with similar architectures to establish that our proposed method outperforms other approaches commonly used in natural language processing research when it comes to choosing the correct policy domain and policy preference. We identify differences in performance across policy domains, paving the way for future work on improving deep learning models for classifying political texts.
Several studies have concentrated on building scaling models that identify the political position of texts (Glavaš, Nanni, and Ponzetto 2017; Laver, Benoit, and Garry 2003; Nanni et al. 2019; Proksch and Slapin 2010). Previously, most of the seminal work in this area has overlooked the task of classifying texts by topic or policy area prior to detecting policy preferences associated with the topic. Over the past couple of years, several studies have addressed the gap in opinion-topic identification by classifying text data from political speeches, manifestos, and other documents by topic before predicting policy preference. Perhaps most relevant to our research is the paper by Zirn et al. (2016), in which the authors trained and validated an approach to classifying manifestos from the United States into seven policy domains that involved binary classifiers predicting whether sentences that are adjacent to one another belong to the same topic1. Their proposed approach of optimizing predictions using a Markov Logic framework yielded an average micro-F1 score of .749. Glavaš, Nanni, and Ponzetto (2017) introduced a multi-lingual classifier for automatically labeling texts by policy domain. For classification of 20,196 English-language manifestos by policy domain, their CNN models yielded an average micro-F1 score of .59.
More recently, studies have employed neural networks and deep language representation models to address the computationally intensive task of classifying political texts into over thirty categories. To take on this ambitious task, included contextual information about individual quasi-sentences, specifically political party and the previous sentence within a manifesto, into multi-scale convolutional neural networks with word embeddings. Their best performing model for classifying 86,500 quasi-sentences from the Manifesto Project Corpus into the seven major policy domains yielded an F1 score of .6532, and their best performing model for classifying quasi-sentences by policy preference yielded an F1 score of .4273. propose employing a hierarchical sequential deep model that captures information from within manifestos as well as contextual information across manifestos to predict the political position of texts. Their best performing hierarchical modeling approach for classifying 86,603 English language quasi-sentences yielded an F1 score of .50.
Abercrombie et al. (2019) (Abercrombie et al. 2019) used deep language representation models to detect the policy positions of Members of Parliament in the United Kingdom. Using motions and manifestos as data sources, the authors employed a variety of methods to predict the policy and domain labels of texts. They propose utilizing BERT for this task, with results fine-tuned with party manifestos and the motions themselves. In addition to a final softmax layer, the authors added a CNN model and max-pooling layers to the soft-max layer. they found that the use of BERT demonstrated state-of-the-art performance on both manifestos and motions via supervised pipelines with a Macro-F1 score of 0.69 for their best performing model. Our work builds on some of the methods proposed in their paper, leveraging neural networks and deep language representation models for classifying political texts.
The Manifesto Project Corpus2 provides information on policy preferences of political parties from seven different countries based on a coding scheme of 8 policy domains, under which 58 policy preference codes are manually coded3. The Manifesto Project offers data that divides party manifestos into quasi-sentences, or individual statements which do not span more than one grammatical sentence. Quasi-sentences are then individually assigned to categories pertaining to policy domain and preference. The 58 policy preference codes, one of which is “not categorized,” refer to the position—positive or negative—of a party regarding a particular policy area. The 58 policy preference codes fall into a macro-level coding scheme comprising of 8 policy domain categories. In political science research, the Manifesto Project Corpus is particularly useful for studying party competition, the responsiveness of political parties to constituent preferences, and estimating the ideological position of political elites. While the official classification of manifestos in this dataset has primarily relied on human coders, the investigation of automatically detecting policy positions of the text data is valuable for scaling up the classification of large volumes of political texts available for analysis.
Our final subset of all English-language manifestos comprises of 99,681 quasi-sentences. Figures 1 and 2 illustrate the distribution of English-language manifestos across countries and policy domains. To ensure that the ratio between policy domains remains consistent across all categories in running our models, we applied a 70/15/15 split between training, validation, and test sets separately for all policy domains (major categories) and policy preferences (minor categories).
Figure 1: Quasi-sentences (QSs) from English language manifestos by policy domain
Figure 2: Quasi-sentences (QSs) from English language manifestos by country
Bidirectional Encoder Representations from Transformers (BERT) have proven successful in prior attempts to classify phrases and short texts (Devlin et al. 2018). BERT’s key innovation lies in its ability to apply bidirectional training of transformers to language modeling. This state-of-the-art deep language representation model uses a “masked language model,” enabling it to overcome restrictions caused by the unidirectional constraint.
Our experiments use the standard pre-trained BERT transformers as the embedding layer in our model. Since BERT is trained on sequences with a maximum lengths of 512 tokens, all quasi-sentences with more than 510 words were trimmed to fit this requirement. Pre-trained embeddings were frozen and not trained for the base models. We test two variants of BERT—one incorporating a bidirectional GRU model, and another incorporating CNNs. Model specifications and training times for our neural networks and deep language representation models are shown in Table 1 and Figure 3.
Models | Text Representation | Layers | Epochs |
---|---|---|---|
CNN | GloVe Wikipedia w-emb | 2 Convolutional Layers (1 per filter) 2 Max Pooling Layers 1 Dropout Layer 1 Linear Layer |
100 |
BERT-CNN | Base BERT (uncased) | 2 Convolutional Layers (1 per filter) 2 Max Pooling Layers 1 Dropout Layer 1 Linear Layer |
10 |
BERT-GRU | Base BERT (uncased) | 1 Bidirectional GRU RNN Layer 1 Dropout Layer 1 Linear Layer |
10 |
Figure 3: Training time for neural networks and deep language representation models for classifying political texts by major and minor policy domain
First proposed by Cho et al. (2014) (Cho et al. 2014), Gated Recurrent Units—formerly referred to as the RNN Encoder-Decoder model—use update gates and reset gates to solve the vanishing gradient problems often encountered in applications of recurrent neural networks (Kanai, Fujiwara, and Iwamura 2017). The update gate helps the model determine the extent to which past information is carried on in the model whilst the reset gate determines the information to be removed from the model (Chung et al. 2014). Hence, it solves the aforementioned problem by not completely removing the new input, instead keeping relevant information to pass on to further subsequent computed states. In our analysis, we employ a multi-layer, bidirectional GRU model from PyTorch4 (Table 1). The results are subject to a dropout layer prior to classification via a linear layer.
We incorporate CNNs with BERT using the same CNN architecture as our baselines (Table 1). The model utilizes the aforementioned BERT base, uncased tokenizer with convolutional filters of sizes 2 and 3 applied with a ReLu activation function. We use a 1D-max pooling layer, a dropout layer (\(N = 0.5\)) to prevent overfitting, and a Cross Entropy Loss function. We employ the model to classify policy domains (\(N = 8\)) and policy preferences (\(N = 58\)), each of which includes a category for quasi-sentences that do not fall into this classification scheme. Hereafter, we refer to these classifications as ‘major’ and ‘minor’ categories, respectively. A graphical representation of our model is shown in Figure 4.
Figure 4: Graphical representation of the base BERT-CNN model to predict major policy domains
We evaluate the performance of our proposed method against several baselines:
Multinomial Naive Bayes (Su, Shirab, and Matwin 2011): This algorithm, commonly used in text classification, operates on the Bag of Words assumption and the assumption of Conditional independence.
Support Vector Machines (Tong and Koller 2001): We used this traditional binary classifier to calculate baselines with the SVC
package from scikit-learn
5, employing a “one-against-one” approach for multi-class classification.
Convolutional Neural Networks (CNN) (Kim 2014; LeCun et al. 1998): To run this deep learning model, originally designed for image classification, we first made use of pre-trained word vectors trained by GloVe, an unsupervised learning algorithm for obtaining vector representations for words (Pennington, Socher, and Manning 2014).
To evaluate model fit, we utilized accuracy and loss as key metrics to compare performance of our CNN baseline and our proposed models (BERT-CNN, BERT-GRU). We calculated the F1-score for each model that we ran. In our results, we present both the Macro-F1 score and Micro-F1 score6.
We tested different modifications of the CNN and BERT models. For the CNN models, we compared the following modifications:
Stemming and Lemmatization: We test whether stemming or lemmatizing text in the pre-processing steps improves predictions using quasi-sentences from the Manifesto Project Corpus.
Dropout rates: We decreased the dropout rate from 0.5 to 0.25 to determine whether fine-tuning dropout rates yield differences in performance. This is because we initially found that our models were overfitting.
Additional linear layer: An additional linear layer was added prior to the final categorization linear layer to establish whether “deeper” neural networks generate improved predictions.
Removal of uncategorized quasi-sentences: We removed quasi-sentences that were labeled as “not categorized” to investigate whether predictions improve with an altered classification scheme of 7 specified policy domains and 57 policy preference codes7.
For the BERT models, we compared the following modifications:
Training Embeddings: For our base BERT models, all training of embeddings were frozen. Therefore, we enable training of the embeddings in this modification to establish how training embeddings contributes to the performance of deep language representation models with this classification task.
Training models based on recurrent runs: We trialed training the BERT models sequentially with different learning rates (LR = 0.001, 0.0005 and 0.0001) of 10 epochs each for a total of 30 epochs in aims to improve the performance of our neural networks and deep language representation models.
Large, cased tokenizer: The BERT Large cased tokenizer was used instead of the BERT BASE uncased tokenizer employed in our base models.
As shown in Table 2, the BERT-CNN model performed best for predicting both major and minor categories compared to the BERT-GRU model and CNN baseline. However, our SVM baseline outperformed the neural network models for predicting minor categories. We believe that the shortcomings of our neural networks and deep language representation models for this text classification task are due to limitations in specifying the number of epochs in training. We also observed overfitting in our models. For instance, with our CNN model, validation loss increased with each additional epoch after a certain number of epochs. As shown in Figure 5, training accuracy of this model also increased at the cost of validation accuracy. However, this was not the case for deep language representation models classifying texts by minor categories. Overall, our results show that, between the two BERT models, the BERT-CNN model demonstrates superior performance against bag-of-words approaches and other models that utilize neural networks.
Category | Model | Test Loss | Test Accuracy | Micro-F1 | Macro-F1 |
---|---|---|---|---|---|
Major | |||||
MNB | — | 0.553 | 0.553 | 0.398 | |
SVM | — | 0.578 | 0.578 | 0.460 | |
CNN | 1.177 | 0.589 | 0.589 | 0.466 | |
BERT-GRU | 1.166 | 0.594 | 0.593 | 0.479 | |
BERT-CNN | 1.152 | 0.591 | 0.591 | 0.473 | |
Minor | |||||
MNB | — | 0.385 | 0.385 | 0.154 | |
SVM | — | 0.463 | 0.463 | 0.299 | |
CNN | 2.136 | 0.454 | 0.454 | 0.273 | |
BERT-GRU | 2.216 | 0.432 | 0.432 | 0.239 | |
BERT-CNN | 2.098 | 0.448 | 0.448 | 0.260 |
Figure 5: An illustration of overfitting in our CNN model for classifying manifesto QSs by major policy domain
Comparing modifications to our CNN models, our results suggest that the base model outperforms most alternative model specifications. As outlined in Table 3, reducing the dropout rate to 0.25 improved the model on some indicators marginally. As expected, the removal of uncategorized quasi-sentences yielded improvements in predictions, with a significantly higher Macro-F1 score compared to other model specifications. Based on these results, future work should focus on how model predictions of uncategorized quasi-sentences can be improved, given their random nature.
Model | Modification | Test Loss | Test Accuracy | Micro-F1 | Macro-F1 | Epochs |
---|---|---|---|---|---|---|
CNN | ||||||
Base model | 1.177 | 0.589 | 0.589 | 0.466 | 100 | |
Lemmatized text | 1.174 | 0.585 | 0.585 | 0.46 | 100 | |
Stemmed text | 1.213 | 0.577 | 0.576 | 0.448 | 100 | |
Dropout = 0.25 | 1.177 | 0.589 | 0.588 | 0.467 | 100 | |
Additional layer | 1.18 | 0.586 | 0.586 | 0.462 | 100 | |
Removing uncategorized QSs | 1.136 | 0.596 | 0.595 | 0.535 | 100 | |
BERT-GRU | ||||||
Base model | 1.152 | 0.594 | 0.593 | 0.479 | 10 | |
Training emb | 1.163 | 0.592 | 0.592 | 0.479 | 10 | |
Recurrent runs, training | 1.234 | 0.582 | 0.581 | 0.459 | 30 | |
Large, uncased | 1.172 | 0.592 | 0.591 | 0.469 | 10 | |
BERT-CNN | ||||||
Base model | 1.166 | 0.591 | 0.591 | 0.473 | 10 | |
Training emb | 1.167 | 0.587 | 0.587 | 0.458 | 10 | |
Recurrent runs, training | 1.157 | 0.589 | 0.589 | 0.468 | 30 | |
Large, uncased | 1.192 | 0.58 | 0.58 | 0.45 | 10 |
While we observed some improvements with modifications to the CNN model, we find that our base BERT models performed best compared to other fine-tuned modifications to model architecture. The results of our base BERT model and alternative model specifications are shown in Table 3. Even though it is possible that our base BERT model is best for this classification model, our results could also indicate the presence of over-fitting or the lack of sufficient training available given the low number of epochs.
As shown in Figure 5, we observed overfitting with our major policy domain classification models. Despite employing changes and modifications to our models, including varied dropout rates, architecture fine-tuning and different learning rates, we did not find any variants of the models employed in analysis that would yield significant improvements in performance. We posit that potential improvements on these issues could be resolved by employing transfer learning and appending our sample of English-language manifestos with other political documents, such as debate transcripts.
In contrast, as shown in Figure 6, we observed little over-fitting in our minor policy domain classification models. Our classifier could benefit from employing transfer learning and appending our sample of manifesto quasi-sentences with other political texts, especially for policy domains with relatively fewer quasi-sentences to train on. It is also important to note that, compared to the more computationally intensive neural networks and deep language representation models, our Multinomial Bayes and SVM baselines did not perform significantly worse. In fact, for the minor categories, the SVM yielded superior performance in some metrics compared to that of the neural network models. Notwithstanding the lack of training of certain models, this may suggest that increasing the model complexity and consequently the computational power required may not necessarily lead to increased model performance.
Figure 6: Training and validation metrics for the BERT-CNN model for classifying English language manifestos by policy preference
Substantially lower Macro-F1 scores across all models point to mixed performance in classification by category. As shown in Figure 7, we observe high variation in the performance of our classifiers between categories. However, we observe poor performance in classifying quasi-sentences that do not belong to one of the major policy domains. For our BERT-CNN model, the easiest categories to predict were “welfare and quality of life,” “economy,” and “freedom and democracy.” The superior performance of predicting the first two categories is not particularly surprising, as a substantial number of quasi-sentences in our sample of English-language party manifestos are attributed to these topics. As shown in Figure 1, 30,750 quasi-sentences are attributed to the “welfare and quality of life” category and 24,757 quasi-sentences are attributed to the “economy” domain.
Figure 7: Average precision, recall, and Macro-F1 scores by major policy domain across all models
In contrast, the relatively superior performance of predicting the “freedom and democracy” category is surprising. Out of our total sample of \(n_{\mathrm{sentences}}=99,681\), only 4,700 documents are attributed to the “freedom and democracy” category. Intuitively, the performance of our classifier with this underrepresented policy domain could be attributed to a variety of possible explanations. One is the presence of distinct features such as topic-unique vocabulary that do not exist in other categories. Future work on classification of political documents that fall under this category would benefit from looking into features that might distinguish this policy domain from others.
In this paper, we trained two variants of BERT—one incorporating a bidirectional GRU model, and another incorporating CNNs. We demonstrate the superior performance of deep language representation models combined with neural networks to classify political domains and preferences in the Manifesto Project. Our proposed method of incorporating BERT with neural networks for classifying English language manifestos addresses issues of reproducibility and scalability in labeling large volumes of political texts. As far as we know, this is the most comprehensive application of deep language representation models and neural networks for classifying statements from political manifestos.
We find that using BERT in conjunction with CNNs yields the best predictions for classifying English language statements parsed from party manifestos. However, our proposed BERT-CNN model requires further fine-tuning to be effective in providing acceptable predictions to improve on less computationally intensive methods and replace human annotations of fine-grained policy positions. As expected, our proposed approach and baselines perform better for classifying major policy domains over minor categories. We also observe differences in performance between categories. Among the major policy domains, the categories that performed best include “welfare and quality of life,” “economy,” and “freedom and democracy.” The superior performance of the latter category is surprising because it makes up the smallest proportion of quasi-sentences in the Manifesto Project Corpus.
There are several avenues for future work on neural networks and deep language representation models for automatically labeling political texts. For instance, investigating the features of individual categories that demonstrate superior performance would shed light on how we can incorporate additional features of texts to improve model performance. This area of research would also benefit from better understanding how we can filter out texts that do not fall into a particular classification scheme. Knowledge on how these issues could be resolved to improve model performance would allow for extensions in the application of deep learning models for classifying political texts.
The data used in analysis comprises of statements from six Democratic and Republican election manifestos from the 2004, 2008 and 2012 elections in the United States.↩︎
The classification scheme of 8 policy domains and 58 policy preferences each include a category of quasi-sentences that are not considered part of any meaningful category.↩︎
The classification scheme of 8 policy domains and 58 policy preferences each include a category of quasi-sentences that are not considered part of any meaningful category.↩︎
The micro score calculates metrics globally, whilst the macro score calculates metrics for each label and reports the unweighted mean.↩︎
This modification is motivated by the results from our base models, which yield lower Macro-F1 scores due to the difficulty of correctly identifying the quasi-sentences that were “not categorized.”↩︎
For attribution, please cite this work as
Koh & Boey (2021, Feb. 6). Student Projects Showcase: Predicting Policy Domains and Preferences with BERT and Convolutional Neural Networks. Retrieved from https://hertie-data-science-lab.github.io/student-projects/posts/predicting-policy-domains-and-preferences-with-bert-and-convolutional-neural-networks/
BibTeX citation
@misc{koh2021predicting, author = {Koh, Allison and Boey, Daniel Kai Sheng}, title = {Student Projects Showcase: Predicting Policy Domains and Preferences with BERT and Convolutional Neural Networks}, url = {https://hertie-data-science-lab.github.io/student-projects/posts/predicting-policy-domains-and-preferences-with-bert-and-convolutional-neural-networks/}, year = {2021} }