Guest post, Lauren Weiss, European Evaluation Society
As you may be aware, the European Evaluation Society’s biennial conference has been postponed to September 2021, due to the COVID-19 pandemic.
In the meantime, EES is continuing to work for you, and we are excited to announce the launch of two new initiatives.
First, our new podcast series, EvalEdge, is now available! It focuses on the role of evaluation in shaping how new and emerging technologies can be adapted in international development and in larger society. It explores the latest technological developments, from dig data and geospatial analysis, to blockchain and Internet of Things (IoTs).
Our first episode features MERL Tech’s co-founder Linda Raftree, who discusses innovative examples of using big data, the ethical considerations to be aware of, and much more! Check it out here!
Building on this momentum, EES is also launching a webinar series titled “Emerging Data Landscapes in M&E.” In partnership with Dev CAFÉ, MERL Tech, and the World Bank IEG, this series is devoted to discussing the use of innovative technologies in the world of evaluation.
This interactive and free webinar will provide concrete examples of using geospatial and location data to improve our M&E practices. It will also discuss the barriers to using such technologies and brainstorm on ways to overcome them, by inviting feedback and questions from the online audience.
It will include speakers from the World Bank IEG, the European Commission’s DEVCO/ESS, and the Global Environment Facility. You can find more information on our website.
In many developing country environments, it is difficult or impossible to obtain recent, reliable estimates of human development. Nationally representative household surveys, which are the standard instrument for determining development policy and priorities, are typically too expensive to collect with any regularity.
Recently, however, researchers have shown the potential for remote sensing technologies to provide a possible solution to this data constraint. In particular, recent work indicates that satellite imagery can be processed with deep neural networks to accurately estimate the sub-regional distribution of wealth in sub-Saharan Africa.
Testing Neural Networks to Process Satellite Imagery
In the paper, Can Human Development be Measured with Satellite Imagery?, Andrew Head, Mélanie Manguin, Nhat Tran, and Joshua Blumenstock explore the extent to which the same approach – of using convolutional neural networks to process satellite imagery – can be used to measure a broader set of human development indicators, in a broader range of geographic contexts.
Their analysis produces three main results:
They successfully replicate prior work showing that satellite images can accurately infer a wealth-based index of poverty in sub-Saharan Africa.
They show that this approach can generalize to predicting poverty in other countries and continents, but that the performance is sensitive to the hyperparameters used to tune the learning algorithm.
They find that this approach does not trivially generalize to predicting other measures of development such as educational attainment, access to drinking water, and a variety of health-related indicators.
This paper shows that while satellite imagery and machine learning may provide a powerful paradigm for estimating the wealth of small regions in sub-Saharan Africa, the same approach does not trivially generalize to other geographical contexts or to other measures of human development.
In this assessment, it is important to emphasize what they mean by “trivially,” because in truth the point they are making is somewhat circumspect. Specifically, what they have shown is that the exact framework—of retraining a deep neural network on night-lights data, and then using those features to predict the wealth of small regions in sub-Saharan Africa—cannot be directly applied to predicting arbitrary indicators in any country with uniformly good results.
This is an important point to make because absent empirical evidence to the contrary, it is likely that policymakers eager to gain quick access to micro-regional measurements of development might be tempted to do exactly what they have done in this paper, without paying careful attention to the thorny issues of generalizability that they have uncovered in this analysis.
It is not the researchers’ intent to impugn the potential for related approaches to provide important new methods for measuring development, but rather to say that such efforts should proceed with caution, and with careful validation.
Why Satellite Imagery Might Fail to Predict Development
The results showed that while some indicators like wealth and education can be predicted reasonably well in many countries, other development indicators are much more brittle, exhibiting high variance between and within countries, and others perform poorly everywhere.
Thus it is useful to distinguish between two possible reasons why the current approach may have failed to generalize to these measures of development.
It may be that this exercise is fundamentally not possible, and that no amount of additional work would yield qualitatively different results.
It is quite possible that their investigation to date has been not been sufficiently thorough, and that more concerted efforts could significantly improve the performance of these models
Insufficient “signal” in the satellite imagery.
The researchers’ overarching goal is to use information in satellite images to measure different aspects of human development. The premise of such an approach is that the original satellite imagery must contain useful information about the development indicator of interest. Absent of such a signal, no matter how sophisticated our computational model, the model is destined to fail.
The fact that wealth specifically can be measured from satellite imagery is quite intuitive. For instance, there are visual features one might expect correlate with wealth—large buildings, metals roofs, nicely paved roads, and so forth.
It may be the case that other measures of human development cannot be seen from above. For instance, it may be a fundamentally difficult task to infer the prevalence of malnutrition from satellite imagery, if the regions with high and low rates of malnutrition appear similar, even though they hypothesize that these indices should correlate with wealth index.
They were, however, surprised by the relative under-performance of models designed to predict access to drinking water, as they expected the satellite-based features to capture proximity to bodies of water, which in turn might affect access to drinking water.
(Over-) reliance on night-lights may not generalize.
Their reliance on night lights might help explain why some indicators were predicted less successfully in some countries than others. An example in their study includes Nepal, where the accuracy in predicting access to electricity was much lower (R2 = 0.24) than in the other countries (R2 = 0.69, 0.44, and 0.54 in Rwanda, Nigeria, and Haiti, respectively).
This may be partly due to the fact that Nepal has a very low population density (half as dense as Haiti and Rwanda) and very high levels of electrification (twice as high as Haiti, Rwanda, and Nigeria).
If the links between electrification, night-lights, and daytime imagery are broken in Nepal, they would expect their modeling approach to fail. More generally, they expect that when a development indicator does not clearly relate to the presence of nighttime lights, it may be unreasonable to expect good performance from the transfer learning process as a whole.
Deep learning vs. supervised feature engineering.
In this paper, the researchers focused explicitly on using the deep/transfer learning approach to extracting information from satellite images. While powerful, it is also possible that other approaches to feature engineering might be more successful than the brute force approach of the convolutional neural network.
For instance, Gros and Tiecke have recently shown how hand-labeled features from satellites, and specifically information about the types of buildings that are present in each image, can be quite effective in predicting population density. Labeling images in this manner is resource intensive, and they did not have the opportunity to test such approaches.
However, they believe that careful encoding of the relevant information from satellite imagery would likely bolster the performance of specific prediction tasks.
Neural Networks Can Still Process Satellite Imagery
Broadly, the researchers remain optimistic that future work using novel sources of data and new computational algorithms can engender significant advances in the measurement of human development.
However, it is imperative that such work proceeds carefully, with appropriate benchmarking and external calibration. Promising new tools for measurement have the potential to be implemented widely, possibly by individuals who do not have extensive expertise in the underlying algorithms.
Applied blindly, these algorithms have the potential to skew subsequent policy in unpredictable and undesirable ways. They view the results of this study as a cautionary example of how a promising algorithm should not be expected to work “off the shelf” in a context that is significantly different from the one in which it was originally developed.