At MERL Tech London, 2018, we invited Michael Bamberger and Rick Davies to debate the question of whether the enthusiasm for Big Data in Evaluation is warranted. At their session, through a formal debate (skillfully managed by Shawna Hoffman from The Rockefeller Foundation) they discussed whether Big Data and Evaluation would eventually converge, whether one would dominate the other, how can and should they relate to each other, and what risks and opportunities there are in this relationship.
Following the debate, Michael and Risk wanted to continue the discussion — this time exploring the issues in a more conversational mode on the MERL Tech Blog, because in practice both of them see more than one side to the issue.
So, what do Rick and Michael think — will big data integrate with evaluation — or is it all just hype?
Rick: In the MERL Tech debate I put a lot of emphasis on the possibility that evaluation, as a field, would be overwhelmed by big data / data science rhetoric. But since then I have been thinking about a countervailing development, which is that evaluative thinking is pushing back against unthinking enthusiasm for the use of data science algorithms. I emphasise “evaluative thinking” rather than “evaluators” as a category of people, because a lot of this pushback is coming from people who would not identify themselves as evaluators. There are different strands to this evaluative response.
One is a social justice perspective, reflected in recent books such as “Weapons of Math Destruction”, “Automated Inequality”, and “Algorithms of Oppression” which emphasise the human cost of poorly designed and or poorly supervised use of algorithms using large amounts of data to improve welfare and justice administration. Another strand is more like a form of exploratory philosophy, and has focused on how it might be possible to define “fairness” when designing and evaluating algorithms that have consequences for human welfare[ See 1, 2, 3, 4]. Another strand is perhaps more technical in focus, but still has a value concern. This is the literature on algorithmic transparency. Without transparency it is difficult to assess fairness [See 5, 6, ] Neural networks are often seen as a particular challenge. Associated with this are discussions about “the right to explanation” and what this means in practice[1,]
In parallel there is also some infiltration of data science thinking into mainstream evaluation practice. DFID is funding the World Bank’s Strategic Impact Evaluation Fund (SIEF) latest call for “nimble evaluations” . These are described as rapid and low cost and likely to take the form of an RCT but ones which are focused on improving implementation rather than assessing overall impact . This type of RCT is directly equivalent to A/B testing used by the internet giants to improve the way their platforms engage with their users. Hopefully these nimble approaches may bring a more immediate benefit to the people’s lives than RCTs which have tried to assess the impact of a whole project and then inform the design of subsequent projects.
Another recent development is the World Bank’s Data Science competition , where participants are being challenged to develop predictive models of household poverty status, based on World Bank Household Survey data. The intention is that they should provide a cheaper means of identifying poor households than simply relying on what can be very expensive and time consuming nationwide household surveys. At present the focus on the supporting website is very technical. As far as I can see there is no discussion of how the winning prediction model will be used and an how any risks of adverse effects might be monitored and managed. Yet as I suggested at MERLTech London, most algorithms used for prediction modelling will have errors. The propensity to generate False Positives and False Negatives is machine learning’s equivalent of original sin. It is to be expected, so it should be planned for. Plans should include systematic monitoring of errors and a public policy for correction, redress and compensation.
Michael: These are both important points, and it is interesting to think what conclusions we can draw for the question before us. Concerning the important issue of algorithmic transparency (AT), Rick points out that a number of widely discussed books and articles have pointed out the risk that the lack of AT poses for democracy and particularly for poor and vulnerable groups. Virginia Eubanks, one of the authors cited by Rick, talks about the “digital poorhouse” and how unregulated algorithms can help perpetuate an underclass. However, I think we should examine more carefully how evaluators are contributing to this discussion. My impression, based on very limited evidence is that evaluators are not at the center — or even perhaps the periphery — of this discussion. Much of the concern about these issues is being generated by journalists, public administration specialists or legal specialists. I argued in an earlier MERL Tech post that many evaluators are not very familiar with big data and data analytics and are often not very involved in these debates. This is a hypothesis that we hope readers can help us to test.
Rick’s second point, about the infiltration of data science into evaluation is obviously very central to our discussion. I would agree that the World Bank is one of the leaders in the promotion of data science, and the example of “nimble evaluation” may be a good example of convergence between data science and evaluation. However, there are other examples where the Bank is on the cutting edge of promoting new information technology, but where potential opportunities to integrate technology and evaluation do not seem to have been taken up. An example would be the Bank’s very interesting Big Data Innovation Challenge, which produced many exciting new applications of big data to development (e.g. climate smart agriculture, promoting financial inclusion, securing property rights through geospatial data, and mapping poverty through satellites). The use of data science to strengthen evaluation of the effectiveness of these interventions, however, was not mentioned as one of the objectives or outputs of this very exciting program.
It would also be interesting to explore to what extent the World Bank Data Science competition that Rick mentions resulted in the convergence of data science and evaluation, or whether it was simply testing new applications of data science.
Finally, I would like to mention two interesting chapters in Cybersociety, Big Data and Evaluation edited by Petersson and Breul (2017, Transaction Publications). One chapter (by Hojlund et al) reports on a survey which found that only 50% of professional evaluators claimed to be familiar with the basic concepts of big data, and only about 10% reported having used big data in an evaluation. In another chapter, Forss and Noren reviewed a sample of Terms of Reference (TOR) for evaluations conducted by different development agencies, where they found that none of the 25 TOR specifically required the evaluators to incorporate big data into their evaluation design.
It is difficult to find hard evidence on the extent to which evaluators are familiar with, sympathetic to, or using big data into their evaluations, but the examples mentioned above show that there are important questions about the progress made towards the convergence of evaluation and big data.
We invite readers to share their experiences both on how the two professions are starting to converge, or on the challenges that slow down, or even constrain the process of convergence.
Or sign up for Michael’s full-day workshop on Big Data and Evaluation in Washington, DC, on September 5th, 2018!