Environmental Links to Health Outcomes in Rwanda and Tanzania

We present a Machine Learning approach to predicting maternal and infant health outcomes in Tanzania and Rwanda, using data of the external and built-environment of the individual.

Vishali Sairam https://github.com/abiola1864/ML4Development , Jean Pierre Salendres
2021-05-09

Abstract

Our Machine Learning project aims to predict maternal and child health outcomes in Rwanda and Tanzania. We use anemia and body mass index (BMI, proxy for malnutrition) as measures to assess maternal and infant health outcomes, respectively. We employ datasets from Demographic and Health Surveys (DHS) that describe an individual’s household characteristics (their built-environment) and the households’ external-environment to classify propensities to these adverse health outcomes. In most remote places, carrying out a full-blow demographic survey and medical screening work would be prohibitively time-consuming and expensive, and using ML prediction algorithms could aid in the targeted use of time and resources. The study’s main question is the following:

For predicting maternal and infant health outcomes, which machine learning method works best and why?

We test the strength of three machine learning methods in measuring the relationship between the built-environment and its effect on our two binary health outcomes. The four machine learning models are Logistic Regression, Decision Trees, Support Vector Machines (SVM), and Random Forest. We find that the environmental factors have a significant predictive power on anaemia level for mothers but low predictive power of BMI deviations for children.

Introduction / Background

Countries across emerging markets are experiencing unprecedented growth in urban settlements. Much of this growth belies the traditional urban-rural dichotomy, prevalent in western thought about there being a marked difference between urban centers and their rural counterparts. In these emerging markets, we observe the growth of peri-urban areas, small towns, medium villages that vary in infrastructure development. There is more to the relationship than a simple link between city size and development level as traditionally conceived in industrialized nations. In this paper, we explore this non-dichotomous relationship and seek to find new variables or characteristics that enhance our understanding of household environments. This will in turn improve our ability to predict health outcomes and channel policy and program resources using new analytic models.

We employ DHS datasets as they provide granular characteristics of the individual’s household. These surveys cover two main types of household environment characteristics. The first is a set of descriptive data on the homes of individuals: the distribution of rooms in the house, access to clean water and sewage systems, material construction of the home, type of roof, among other characteristics. We will describe this set of data points as the surveyed individual’s built-environment. The second set of data points relate to the external-environment and include measure of the weather, aridity of the land, propensity for cultivation, night-time luminosity, among other variables. By assembling these two types of variables, we seek to find how much they can add to the understanding and prediction of health outcomes in a context where the rural-urban dichotomy is insufficient to predict health outcome propensities. The observed variation in the built-environment and external-environment variables will enhance our understanding of the continuum in population center types to better describe and classify target groups.

Multiple sources of literature have characterized the varied nature of urban transformation in developing countries. Djikstra et al (2018) developed a new measure of urbanization using spatial data and differential measures of population grid to show that population share in rural areas are radically different in many Asian and African countries. The population share in rural areas as defined by the degree of urbanisation is similar to the share reported based on national definitions in most countries in the Americas and Europe, but radically different in many African and Asian countries. A similar methodology is followed by Galdo, Li and Rama (2020), but in the case of India.

Moreover, applications of satellite sensing have been used extensively to study economic well-being and poverty in Africa. Georganos et al (2019), employ satellite-derived VHR land-use/land-cover (LULC) datasets and couple them with the DHS Wealth Index (WI) to provide a city scale wealth index of Dakar. Night-time luminosity data has been linked to studies about poverty and economic development in several studies including Chen and Nordhaus (2011) and Bruederle (2018).

Lastly, we place our study in the increasingly researched literature linking environmental change and urbanization to specific health outcomes across Africa and Asia. Montgomery (2008) uses data from urban samples of 85 Demographic and Health Surveys to find that the neighborhoods of relatively poor households are more heterogeneous than is often asserted. However, we are not aware of existing literature going beyond economic or health metrics using broad categories and diving into changes in specific health outcomes, such as maternal anemia and infant malnutrition. More importantly, the literature provides examples of global or continental analyses rather than more focused country-studies that may exploit context-specific variation in our urbanization and health outcomes data.

Proposed Method

Given that we have a total of 23,249 observations spread across both countries and outcomes, we use four widely used binary classification techniques to create various models to predict outcomes. The algorithms we chose are Logistic Regression (as the baseline), Decision Trees, Support Vector Machines and one example of Ensemble Learning (Random Forests). All four models that we use are supervised training algorithms since the data we have is labelled and since our focus here is on predicting output. In order to observe trends and contributions of different household characteristics, we also apply a dimensionality reduction to the dataset. The idea is that this overview may uncover the relationships between observations and variables, and among the variables.

Since we include all the environment variables in the initial models, the high dimension size makes it more difficult to accurately predict outcomes. In order to address this, we apply an initial feature selection. Specifically, we do univariate feature selection and feature importance analysis to understand which variables are most predictive.

Performing feature selection before modeling data has several benefits, among them reducing overfitting. With this, there is less redundancy in the data we use, and less chances of making predictions based on noise. Moreover, it improves accuracy by producing more valuable and credible data. Lastly, feature selection is able to reduce training time in all models since there is less data and work being done on the back-end of the machine learning models. Calculating feature importance scores can be helpful in problems that involve predicting a numerical value (binary outcome in this study).

Logistic regression (Method 1, Baseline) has typically been used to analyze binary data and is commonly used as an inferential tool and binary classification model in population health research. Since it has generally been the go-to method for binary classification problems like ours, we use it as the baseline model. In this, we use the liblinear model which applies automatic parameter selection. This is recommended when having a high dimension dataset and for using in large-scale classification problems like in our case. Related studies like Beluzzo (2020) use logistic regression as the baseline model in their models to predict infant mortality. On varying levels, these studies come to the conclusion that logistic regression models are quite efficient in classifying mortality rates.

Model 2 is a categorical variable decision tree model. Decision Trees use multiple algorithms to decide to split a node into two or more sub-nodes which are more homogenous in nature.There is a possibility that the decision tree might overfit the model, and it is for this reason that we use grid search parameter tuning with “recall” scoring. However, Decision Trees generally work best with categorical variables, and hence might be less suited to our case which consist of both categorical and continuous.

Model 3 uses a Support Vector Machine classification. An SVM model finds an optimal boundary between the possible outputs by transforming the data. Model 4 concerns Random Forest which is commonly used in machine learning situations because it is highly flexible and provides better predictive performance. The RF model samples the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees.

Models employed

Figure 1: Models employed

Experiments

Visualization of this study's data processes

Figure 2: Visualization of this study’s data processes

Data: We intend to combine transnational data from two East African Countries: Rwanda and Tanzania. They constitute two of the six countries in the East African Community (EAC). The main data source is the Demographic and Health Surveys (DHS) Program which collects representative national and sub-national data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries. The DHS data includes households, geographical and geo-spatial data that will be matched by their cluster (district) ID. We are focusing on cluster-level data for a granular representation of urbanization in the DHS data from 2015. There is a data collection round at least once every two to three years.

This project selects two dependent variables: one used to predict maternal health and one to predict infant health. We expect heterogeneity across rural/urban residences and we seek to determine what will be the built and external environment variables that help us predict health outcomes.

The maternal health variable, taken from DHS, is anemia (ha57), which is measured through a medical check that is performed on the mother of the household during house surveys. By choosing this outcome variable, we expect a reduction of observations since only a percentage of the surveys perform this health check. However, we find the number of observations statistically significant and enough to carry on with anemia as our maternal health outcome variable. Moreover, the medical test provides a robust and reliable variable that accurately measures anemia, unlike questions on past health episodes that do not rely on a medical test and increase the uncertainty of the measure. The variable is coded in 4 categories: severe, moderate, mild, and not anemic. We recode it into a binary variable by collapsing severe, moderate, and mild as anemic (1) and the rest as not anemic (0).

The infant health variable, also taken from DHS datasets, is a standard deviation combined measure for weight and height (hc11). We recorded this variable into a binary variable that focuses on the negative side of the scale, therefore focusing on being underweight or under-sized. We assume that this will accurately measure levels of malnutrition in the population. We select a range of underweight and under-size that we will consider as normal (-30 units, or -1 S.D.). Measures under this will be coded as malnutrition (1), the remaining observations within the normal range or above will be coded as healthy (0).

Feature Selection Maternal Health Outcome

Figure 3: Feature Selection Maternal Health Outcome

As we can observe from Table I above, for maternal health outcomes, there is a significant difference in predictive capacity between our two set of independent variables. External-environment variables (rainfall, temperatures, wet days, etc) have lower predictive capacity for our maternal health outcome (anemia). In comparison, the built-environment variables (hv–, representing household rooms, type of toilet facilities, etc) have a much higher predictive capacity. Specifically, the time to nearest water source (hv204), the number of rooms used for sleeping (hv216), and the number of mosquito bed nets (hml1) show around 40-60% higher predictive power.

In parallel, for child health outcomes, we see a similar breakdown in the feature selection chart (Table II). The top 15 of 20 variables are household variables from the built-environment dataset. Once again, we see that time to nearest water source (hv204), number of rooms used for sleeping (hv216), and number of mosquito beds (hml1) are the top predictor variables.

Feature Selection Infant Health Outcome

Figure 4: Feature Selection Infant Health Outcome

Software: The Python programming language (version 3.6.0) and the scikit learn package (Kuhn, 2020) was used to perform the data processing and analysis. We also used R programming language (version 3.6.0) for obtaining the data, merging datasets and initial cleaning.

Hardware: If relevant, list hardware resources you used.

Evaluation method: For our binary classification models, we use three methods to evaluation:

Accuracy Rate: The percentage of correct predictions of the given outcome

False Negative Rate: Type 2 error

Recall / True Positive Rate: True Positives / Total Positives - The percentage of those correctly identified as having malnutrition of the total people having malnutrition. A higher Recall is better. Grid search algorithm uses recall as a scoring metric for hyperparameter tuning.

Experimental details: How you ran your experiments (e.g. model configurations, learning rate, training time, etc.)

Results:

Model Results

Figure 5: Model Results

Comment on quantitative results: We performed the analysis using three different sets of data. The first dataset contains analysis on the entire model, the second dataset uses the grid search parameters and the second contains only the top 20 features selected through a feature selection method. We evaluate the model predictions using accuracy rate (since the classes are not highly imbalanced in our case) and with false negative rate (false negatives / false negatives + total positives). The latter provides information about the percent of positive cases that were identified as negative. The accuracy rate is the total percentage of correct predictions. In case of anaemia, we see that logistic and random forests perform best albeit with high false negative rates. Decision trees perform best in terms of false negative rates. In the case of malnourishment, all the models perform worse than in the case of anaemia. Moreover, while there is little difference between accuracy scores on the models, a grid search with all the features has a lower false negative rate indicating that it is better to choose all the variables for malnourishment. In the case of anaemia, the models with varying features give somewhat similar false negative rates.

Here, all features with default parameters are our base. For both bmi and anaemia, the models acheive average accuracy rates while achieving lower false negative rates. The best performing model for anaemia is a decision tree with select features. Even with all the features using a grid search, the decision tree model seems to be a better fit with 62 percent accuracy and 0.57 false negative rate. However, in all the cases, the false negative rates seem to be very high, essentially half the total positive cases are being classified as not being anemic. In the case of malnutrition, the results are quite similar.

Analysis

Outcome Variable: A key aspect that determines the performance of a model is the selection of outcome variables. We chose weight-height standard deviation as our measure and since it is a continuous measure, we tried to convert it to a binary variable. We tried multiple thresholds before setting on less than 30 sd as severely underweight. Our analysis would be affected by this.

Binary Classification Methods: Our results do not show a significant difference between the baseline model and other models in terms of accuracy rates while there are differences across false negative rates. While decision trees are a better choice for anaemia, random forests seem to be a better fit for malnutrition. Random forests leverage the power of multiple decision trees, and provide higher accuracy rates in both the cases. However, the presence of high false negative rates especially in the case of anaemia could mean that there is over-fitting.

Feature Importance: Our results suggest that household built environment variables are important predictors of health outcomes. We contrast this with the model with all features, where the accuracy and false negative rates remain similar.

Tradeoff between higher accuracy and false negative rates: While none of the models work with more than a 65% accuracy, this exercise led us to understand that a model with higher accuracy is not always better in our case, since there may still be higher false negative rates.

Precision Recall Curve / AUC : The above may present an instance of class imbalance problem, and an area under AUC curve which shows tradeoff between precision and recall may be a better measure to interpret the models (See Appendix 1). Precision defines the proportion of cases that were correctly identified, while recall measures the right cases of the identified cases. We see that the decision tree does seem to be a better model in the case of anaemia, since the curves cross at a lower threshold.

PCA analysis: We also ran the same model with a feature selected dataset using a PCA analysis. The results are displayed below. Using a PCA does not change significantly the predictive power of any of the models.

Precision Recall as function of Decision Threshold: Anaemia

Figure 6: Precision Recall as function of Decision Threshold: Anaemia

Conclusions

Some of the primary limitations of the work involve the dependent variables: maternal anemia or child malnutrition. There is a large amount of observations and survey data that is lost when honing in on one specific health outcome proxy variable. For maternal health, the anemia test is not applied across all surveys and therefore limits the quantity of matched data. Similarly, the number of observations where surveyors weighted the child and measured the height of children in the household sets an upper limit to the observations. Overall, the choice of proxy variable for the health outcome leads to reductions in the sample size. It is difficult to overcome this obstacle using labelled data methods. Unsupervised machine learning models may provide a possible research avenue since we might not be limited to labeled data, as is the case for our project.

In addition, as potential future work, we recommend incorporating external satellite imagery. This can also involve neural network models that can capture factors outside of the reach of classic surveys that better inform the granular variations of households along the urban-rural spectrum that aid in the prediction of specific health outcomes. Satellite imagery can have high costs or be difficult to access and clean-up but this may produce a more accurate and potentially real-time assessments of health outcome probabilities across regions based on environmental factors.

Future Research One way to obtain more predictive power is to tackle the class imbalance problem in case of the anaemia dataset. This could be done by either oversampling the positive class, by undersampling the negative class or by adding class weights. In case of the malnutrition data, creating an index of malnutrition by including other related variables from the dataset could lead to a potentially more accurate model.

Main References

Beluzo, Carlos Eduardo, et al. Machine Learning to Predict Neonatal Mortality UsingPublic Health Data from S ̃ao Paulo - Brazil. doi:https://doi.org/10.1101/2020.06.19.20112953.

Bruederle, A., Hodler, R. (2018). Nighttime lights as a proxy for human development at the local level. PloS one, 13(9), e0202231. https://doi.org/10.1371/journal.pone.0202231

Lee, Jennifer, et al. “Predicting Mortality Risk for Preterm Infants Using Random Forest.” Scientific Reports, vol. 11, no. 1, Mar. 2021, p. 7308. www.nature.com, doi:10.1038/s41598-021-86748-4.

Appendix I

Precision Recall as function of Decision Threshold: Anaemia

Figure 7: Precision Recall as function of Decision Threshold: Anaemia

Precision Recall as function of Decision Threshold: Malnutrition

Figure 8: Precision Recall as function of Decision Threshold: Malnutrition

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/allisonkoh/distill-template/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".