This project uses a machine learning framework to identify the strongest predictors of reading scores for boys and girls from the complete 2018 PISA dataset.
The underlying data for this ML project was taken from the most recent 2018 iteration of the PISA, considered one the most comprehensive standardized assessments in education, and yardstick for national education strategies and education policy. The administered student questionnaire assesses the academic performance, attitudes and learning environments of 15-year olds in 79 countries around the world. Most countries assessed between 4,000 and 8,000 students, which resulted in a data frame with a total of 612,004 observations of individual students across 1,120 columns.
In order to work with the data, a random sample of 100,000 observations was taken and feature selection was conducted as a first step in line with a literature review of previous research on literacy using PISA data (Brow 2019). In order to retain as much of the original data, independent variables were imputed using the MissForest imputer and while categorical values were One Hot Encoded, normalization was applied on each ordinal and continuous feature, transforming it to a value between 0 and 1 with the sklearn’s MinMaxScaler. In the MissForest model, missing values were filled in using median/mode imputation and in subsequent iterations, the missing values are predicted using the others as training rows, which are fed into a Random Forest model (stage 2 in fig. 2). The final result yielded a dataset of 205 features with high internal validity, which were used as independent variables to predict the continuous variable reading score, with a mean of 500 points with approximately 100 points of standard deviation. Due to a large number of observations, the project employed a reduced testing-training set split from 80% to 70% training data, 10% testing data and two validation sets with 10% of the data each (stage 3).
All models were evaluated using Root Mean Square Error (RMSE) as the metric. Figure 3 visualizes the model selection process. The baseline model scored an RMSE of 68.52. The only improvement of that score was achieved by the ridge regression. Initially, after 10-fold cross validation, the RMSE was 69.15. In the second step, the model was fine-tuned using SciKitLearn’s GridSearchCV. The final hyperparameters found through grid search were alpha=5, fit_intercept=True and slover=‘sag.’ Through these changes, the RMSE score slightly improved to 68.00.
Neither ridge regression with polynomial transformed data nor any of the tree methods applied yielded improved scores. After fine-tuning, the RMSE score of the ridge regression with polynomial data was around 80.67. Also, the best performing Random Forest scored a few points above the baseline (72.62). The Extra Trees model (70.29) did improve the score in comparison to Random Forest but still performed worse than the linear regression model. Thus, to build an ensemble we chose the tuned ridge regression model (RMSE 68.00) and the best ExtraTrees model (RMSE 70.29). The score of the ensemble (RMSE 66.40) overperformed the baseline.
The model choice was based on a comparison of the RMSE of all models and considerations of interpretability. Although the ensemble performed best, it could not be chosen for the subsequent feature analysis due to a lack of appropriate interpretation methods regarding feature importance. Thus, the fine-tuned ridge regression was chosen to examine the most important features for boys and girls.
Figure 4 provides an overview of the 10 most important predictors for girls and boys, including the coefficients that derived from the gender specific models (yellow numbers represent the female, green numbers the male subset). Depending on the direction of its sign, this relative score highlights the features that are most relevant for predicting the reading score, either positively or negatively. A comparison of the most important predictors for both groups shows that they vary in magnitude, but also in order of their importance. This validates our research question. On the one hand, certain teaching strategies, such as short summaries of previous lessons have a much larger coefficient for boys than for girls (30.7 vs. 20.6, not included in the top ten predictors for girls), while other factors, such as the expectation of a Bachelor’s degree, have a strong correlation with reading performance for boys (24.62), but are not included in the most important predictors for girls at all. Some predictors, on the other hand, rank similarly in order of magnitude, such as perseverance (“If I am not good at something, I would rather keep struggling to master it than move on to something I may be good at”) or the impact of grade repetition.
The scatterplot in Fig. 5 demonstrates the model’s predictions for reading scores of individual students. While perfect alignment with the red line represents the hypothetical perfect prediction of actual scores, deviations from the red line represent a larger prediction error. Judging from the shape of the scatterplot, the model predicts values quite uniformly across for different reading scores, however, there is a slight tendency to underestimate predictions for low-performing students and a tendency of the model to overestimate the scores of high-performing students.The gender differences in predictors suggest that boys and girls respond to their learning environment differently. Designing gender-specific interventions and having an increased awareness of how these factors are expected to impact reading scores differently, may be a key strategy in boosting reading scores of low performing boys in particular.
The workflow of data preprocessing, modeling and fine tuning presented in this project can be replicated relatively easily and customized to specific subsets of the data. It could therefore serve as a blueprint for future research in certain geographical areas or comparative research between countries of the PISA sample. Since dividing the dataset by gender produces pronounced differences, it may be worthwhile to compare subsets of the data by other factors, such as high- and low-performing students or demographic variables, such as income.
This project demonstrated an attempt at combining sophisticated machine learning approaches and algorithms with an established field of education research. While this projects’ research question and sampling strategy was intentionally global and broad, future research could be an avenue for wedding domain-specific knowledge and focused research questions with strategies from the field of causal inference. Together, these two domains hold the potential to reasonably suggest policy interventions for populations of interest.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/AnnaWeronikaMatysiak/PISA_Revisited, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".