Tag Archives: survey

Integrating Big Data into Evaluation: a conversation with Michael Bamberger and Rick Davies

At MERL Tech London, 2018, we invited Michael Bamberger and Rick Davies to debate the question of whether the enthusiasm for Big Data in Evaluation is warranted. At their session, through a formal debate (skillfully managed by Shawna Hoffman from The Rockefeller Foundation) they discussed whether Big Data and Evaluation would eventually converge, whether one would dominate the other, how can and should they relate to each other, and what risks and opportunities there are in this relationship.

Following the debate, Michael and Risk wanted to continue the discussion — this time exploring the issues in a more conversational mode on the MERL Tech Blog, because in practice both of them see more than one side to the issue.

So, what do Rick and Michael think — will big data integrate with evaluation — or is it all just hype?

Rick: In the MERL Tech debate I put a lot of emphasis on the possibility that evaluation, as a field, would be overwhelmed by big data / data science rhetoric. But since then I have been thinking about a countervailing development, which is that evaluative thinking is pushing back against unthinking enthusiasm for the use of data science algorithms. I emphasise “evaluative thinking” rather than “evaluators” as a category of people, because a lot of this pushback is coming from people who would not identify themselves as evaluators. There are different strands to this evaluative response.

One is a social justice perspective, reflected in recent books such as “Weapons of Math Destruction”, “Automated Inequality”, and “Algorithms of Oppression” which emphasise the human cost of poorly designed and or poorly supervised use of algorithms using large amounts of data to improve welfare and justice administration. Another strand is more like a form of exploratory philosophy, and has focused on how it might be possible to define “fairness” when designing and evaluating algorithms that have consequences for human welfare[ See 1, 2, 3, 4]. Another strand is perhaps more technical in focus, but still has a value concern. This is the literature on algorithmic transparency. Without transparency it is difficult to assess fairness [See 5, 6, ] Neural networks are often seen as a particular challenge. Associated with this are discussions about “the right to explanation” and what this means in practice[1,]

In parallel there is also some infiltration of data science thinking into mainstream evaluation practice. DFID is funding the World Bank’s Strategic Impact Evaluation Fund (SIEF) latest call for “nimble evaluations” [7]. These are described as rapid and low cost and likely to take the form of an RCT but ones which are focused on improving implementation rather than assessing overall impact [8]. This type of RCT is directly equivalent to A/B testing used by the internet giants to improve the way their platforms engage with their users. Hopefully these nimble approaches may bring a more immediate benefit to the people’s lives than RCTs which have tried to assess the impact of a whole project and then inform the design of subsequent projects.

Another recent development is the World Bank’s Data Science competition [9], where participants are being challenged to develop predictive models of household poverty status, based on World Bank Household Survey data.  The intention is that they should provide a cheaper means of identifying poor households than simply relying on what can be very expensive and time consuming nationwide household surveys. At present the focus on the supporting website is very technical. As far as I can see there is no discussion of how the winning prediction model will be used and an how any risks of adverse effects might be monitored and managed.  Yet as I suggested at MERLTech London, most algorithms used for prediction modelling will have errors. The propensity to generate False Positives and False Negatives is machine learning’s equivalent of original sin. It is to be expected, so it should be planned for. Plans should include systematic monitoring of errors and a public policy for correction, redress and compensation.

Michael:  These are both important points, and it is interesting to think what conclusions we can draw for the question before us.  Concerning the important issue of algorithmic transparency (AT), Rick points out that a number of widely discussed books and articles have pointed out the risk that the lack of AT poses for democracy and particularly for poor and vulnerable groups. Virginia Eubanks, one of the authors cited by Rick, talks about the “digital poorhouse” and how unregulated algorithms can help perpetuate an underclass.  However, I think we should examine more carefully how evaluators are contributing to this discussion. My impression, based on very limited evidence is that evaluators are not at the center — or even perhaps the periphery — of this discussion. Much of the concern about these issues is being generated by journalists, public administration specialists or legal specialists.  I argued in an earlier MERL Tech post that many evaluators are not very familiar with big data and data analytics and are often not very involved in these debates.  This is a hypothesis that we hope readers can help us to test.

Rick’s second point, about the infiltration of data science into evaluation is obviously very central to our discussion.  I would agree that the World Bank is one of the leaders in the promotion of data science, and the example of “nimble evaluation” may be a good example of convergence between data science and evaluation.  However, there are other examples where the Bank is on the cutting edge of promoting new information technology, but where potential opportunities to integrate technology and evaluation do not seem to have been taken up.  An example would be the Bank’s very interesting Big Data Innovation Challenge, which produced many exciting new applications of big data to development (e.g. climate smart agriculture, promoting financial inclusion, securing property rights through geospatial data, and mapping poverty through satellites). The use of data science to strengthen evaluation of the effectiveness of these interventions, however, was not mentioned as one of the objectives or outputs of this very exciting program.  

It would also be interesting to explore to what extent the World Bank Data Science competition that Rick mentions resulted in the convergence of data science and evaluation, or whether it was simply testing new applications of data science.

Finally, I would like to mention two interesting chapters in Cybersociety, Big Data and Evaluation edited by Petersson and Breul (2017, Transaction Publications).  One chapter (by Hojlund et al) reports on a survey which found that only 50% of professional evaluators claimed to be familiar with the basic concepts of big data, and only about 10% reported having used big data in an evaluation.  In another chapter, Forss and Noren reviewed a sample of Terms of Reference (TOR) for evaluations conducted by different development agencies, where they found that none of the 25 TOR specifically required the evaluators to incorporate big data into their evaluation design.

It is difficult to find hard evidence on the extent to which evaluators are familiar with, sympathetic to, or using big data into their evaluations, but the examples mentioned above show that there are important questions about the progress made towards the convergence of evaluation and big data.  

We invite readers to share their experiences both on how the two professions are starting to converge, or on the challenges that slow down, or even constrain the process of convergence.

Take our survey on Big Data and Evaluation!

Or sign up for Michael’s full-day workshop on Big Data and Evaluation in Washington, DC, on September 5th, 2018! 

What Are Your ICT4D Challenges? Take a DIAL Survey to Learn What Helps and Hurts Us All

By Laura Walker McDonald, founder of BetterLab.io. Originally posted on ICT Works on March 26, 2018.

DIAL ICT4D Survey

When it comes to the impact and practice of our ICT4D work, we’re long on stories and short on evidence. My previous organization, SIMLab, developed Frameworks on Context Analysis andMonitoring and Evaluation of technology projects to try and tackle the challenge at that micro level.

But we also have little aggregated data about the macro trends and challenges of our growing sector. That’s led the Digital Impact Alliance (DIAL) to conduct an entirely new kind of data-gathering exercise, and one that would add real quantitative data to what we know about what it’s like to implement projects and develop platforms.

Please help us gather new insights from more voices

Please take our survey on the reality of delivering services to vulnerable populations in emerging markets using digital tools. We’re looking for experiences from all of DIAL’s major stakeholder groups:

  • NGO leaders from the project site to the boardroom;
  • Technology experts;
  • Platform providers and mobile network operators;
  • Governments and donors.

We’re adding to this survey with findings with in-depth interviews with 50 people from across those groups.

Please forward this survey!

We want to hear from those whose voices aren’t usually heard by global consultation and research processes. We know that the most innovative work in our space happens in projects and collaborations in the Global South – closest to the underserved communities who are our highest priority.

Please forward this survey to we can hear from those innovators, from the NGOs, government ministries, service providers and field offices who are doing the important work of delivering digital-enabled services to communities, every day.

It’s particularly important that we hear from colleagues in government, who may be supporting digital development projects in ways far removed from the usual digital development conversation.

Why should I take and share the survey?

We’ll use the data to help measure the impact of what we do – this will be a baseline for indicators of interest to DIAL. But it will provide a unique opportunity for you to help us build a unique snapshot of the challenges and opportunities you face in your work, in funding, designing, or delivering these services.

You’ll be answering questions we don’t believe are asked enough – about your partnerships, about how you cover your costs, and about the technical choices you’re making, specific to the work you do – whether you’re a businessperson, NGO worker, technologist, donor, or government employee.

How do I participate?

Please take the survey here. It will take 15-20 minutes to complete, and you’ll be answering questions, among others, about how you design and procure digital projects; how easy and how cost-effective they are to undertake; and what you see as key barriers. Your response can be anonymous.

To thank you for your time, if you leave us your email, we’ll share our findings with you and invite you into the conversation about the results. We’ll also be sharing our summary findings with the community.

We hope you’ll help us – and share this link with others.

Please help us get the word out about our survey, and help us gather more and better data about how our ecosystem really works.

Data quality in the age of lean data

by Daniel Ramirez-Raftree, MERL Tech support team.

Evolving data collection methods call for evolving quality assurance methods. In their session titled Data Quality in the Age of Lean Data, Sam Schueth of Intermedia, Woubedle Alemayehu of Oxford Policy Management, Julie Peachey of the Progress out of Poverty Index, and Christina Villella of MEASURE Evaluation discussed problems, solutions, and ethics related to digital data collection methods. [Bios and background materials here]

Sam opened the conversation by comparing the quality assurance and control challenges in paper assisted personal interviewing (PAPI) to those in digital assisted personal interviewing (DAPI). Across both methods, the fundamental problem is that the data that is delivered is a black box. It comes in, it’s turned into numbers and it’s disseminated, but in this process alone there is no easily apparent information about what actually happened on the ground.

During the age of PAPI, this was dealt with by sending independent quality control teams to the field to review the paper questionnaire that was administered and perform spot checks by visiting random homes to validate data accuracy. Under DAPI, the quality control process becomes remote. Survey administrators can now schedule survey sessions to be recorded automatically and without the interviewer’s knowledge, thus effectively gathering a random sample of interviews that can give them a sense of how well the sessions were conducted. Additionally, it is now possible to use GPS to track the interviewers’ movements and verify the range of households visited. The key point here is that with some creativity, new technological capacities can be used to ensure higher data quality.

Woubedle presented next and elaborated on the theme of quality control for DAPI. She brought up the point that data quality checks can be automated, but that this requires pre-survey-implementation decisions about what indicators to monitor and how to manage the data. The amount of work that is put into programming this upfront design has a direct relationship on the ultimate data quality.

One useful tool is a progress indicator. Here, one collects information on trends such as the number of surveys attempted compared to those completed. Processing this data could lead to further questions about whether there is a pattern in the populations that did or did not complete the survey, thus alerting researchers to potential bias. Additionally, one can calculate the average time taken to complete a survey and use it to identify outliers that took too little or too long to finish. Another good practice is to embed consistency checks in the survey itself; for example, making certain questions required or including two questions that, if answered in a particular way, would be logically contradictory, thus signaling a problem in either the question design or the survey responses. One more practice could be to apply constraints to the survey, depending on the households one is working with.

After this discussion, Julie spoke about research that was done to assess the quality of different methods for measuring the Progress out of Poverty Index (PPI). She began by explaining that the PPI is a household level poverty measurement tool unique to each country. To create it, the answers to 10 questions about a household’s characteristics and asset ownership are scored to compute the likelihood that the household is living below the poverty line. It is a simple, yet effective method to evaluate household level poverty. The research project Julie described set out to determine if the process of collecting data to create the PPI could be made less expensive by using SMS, IVR or phone calls.

Grameen Foundation conducted the study and tested four survey methods for gathering data: 1) in-person and at home, 2) in-person and away from home, 3) in-person and over the phone, and 4) automated and over the phone. Further, it randomized key aspects of the study, including the interview method and the enumerator.

Ultimately, Grameen Foundation determined that the interview method does affect completion rates, responses to questions, and the resulting estimated poverty rates. However, the differences in estimated poverty rates was likely not due to the method itself, but rather to completion rates (which were affected by the method). Thus, as long as completion rates don’t differ significantly, neither will the results. Given that the in-person at home and in-person away from home surveys had similar completion rates (84% and 91% respectively), either could be feasibly used with little deviation in output. On the other hand, in-person over the phone surveys had a 60% completion rate and automated over the phone surveys had a 12% completion rate, making both methods fairly problematic. And with this understanding, developers of the PPI have an evidence-based sense of the quality of their data.

This case study illustrates the the possibility of testing data quality before any changes are made to collection methods, which is a powerful strategy for minimizing the use of low quality data.

Christina closed the session with a presentation on ethics in data collection. She spoke about digital health data ethics in particular, which is the intersection of public health ethics, clinical ethics, and information systems security. She grounded her discussion in MEASURE Evaluation’s experience thinking through ethical problems, which include: the vulnerability of devices where data is collected and stored, the privacy and confidentiality of the data on these devices, the effect of interoperability on privacy, data loss if the device is damaged, and the possibility of wastefully collecting unnecessary data.

To explore these issues, MEASURE conducted a landscape assessment in Kenya and Tanzania and analyzed peer reviewed research to identify key themes for ethics. Five themes emerged: 1) legal frameworks and the need for laws, 2) institutional structures to oversee implementation and enforcement, 3) information systems security knowledge (especially for countries that may not have the expertise), 4) knowledge of the context and users (are clients comfortable with their data being used?), and 5) incorporating tools and standard operating procedures.

Based in this framework, MEASURE has made progress towards rolling out tools that can help institute a stronger ethics infrastructure. They’ve been developing guidelines that countries can use to develop policies, building health informatic capacity through a university course, and working with countries to strengthen their health information systems governance structures.

Finally, Christina explained her take on how ethics are related to data quality. In her view, it comes down to trust. If a device is lost, this may lead to incomplete data. If the clients are mistrustful, this could lead to inaccurate data. If a health worker is unable to check or clean data, this could create a lack of confidence. Each of these risks can lead to the erosion of data integrity.

Register for MERL Tech London, March 19-20th 2018! Session ideas due November 10th.

Discrete choice experiment (DCE) to generate weights for a multidimensional index

In his MERL Tech Lightning Talk, Simone Lombardini, Global Impact Evaluation Adviser, Oxfam, discussed his experience with an innovative method for applying tech to help determine appropriate metrics for measuring concepts that escape easy definition. To frame his talk, he referenced Oxfam’s recent experience with using discrete choice experiments (DCE) to establish a strategy for measuring women’s empowerment.

Two methods already exist, Simone points out, for transforming soft concepts into hard metrics. First, the evaluator could assume full authority and responsibility over defining the metrics. Alternatively, the evaluator could design the evaluation so that relevant stakeholders are incorporated into the process and use their input to help define the metrics.

Though both methods are common, they are missing (for practical reasons) the level of mass input that could make them truly accurate reflections of the social perception of whatever concept is being considered. Tech has a role to play in scaling the quantity of input that can be collected. If used correctly, this could lead to better evaluation metrics.

Simone described this approach as “context-specific” and “multi-dimensional.” The process starts by defining the relevant characteristics (such as those found in empowered women) in their social context, then translating these characteristics into indicators, and finally combining indicators into one empowerment index for evaluating the project.

After the characteristics are defined, a discrete choice experiment can be used to determine its “weight” in a particular social context. A discrete choice experiment (DCE) is a technique that’s frequently been used in health economics and marketing, but not much in impact evaluation. To implement a DCE, researchers present different hypothetical scenarios to respondents and ask them to decide which one they consider to best reflect the concept in question (i.e. women’s empowerment). The responses are used to assess the indicators covered by the DCE, and these can then be used to develop an empowerment index.

This process was integrated into data collection process and added 10 mins at the end of a one hour survey, and was made practicable due to the ubiquity of smartphones. The results from Oxfam’s trial run using this method are still being analyzed. For more on this, watch Lombardini’s video below!