MERL Tech News

Mobile survey completion rates: Correlation versus causation

by Kim Rodgers, Software Engineer at Echo Mobile. Original post appeared on Medium.

Introduction

We hear the terms “correlation” and “causation” a lot, but what do they actually mean?

Correlation: defines how two variables relate with each other when they change. When one variable increases, the other may increase, decrease or remain the same. For example, when it rains more, people tend to buy more umbrellas.

Causation: implies that one variable causes another variable to change. For example, we can confidently conclude that more rain causes more people to acquire umbrellas.

In this post, I will explore the meaning of the terms and try to explain a way of deciding how they relate. I will use a real-world example to explore and explain.

Survey completion rate correlations

Echo Mobile helps organizations in Africa engage, influence, and understand their target audience via mobile channels. Our core product is a web-based SaaS platform that, among many other things, enables users to design, send and analyze the results of mobile surveys. Our users can deploy their surveys via SMS (Short Messaging Service), USSD (Unstructured Supplementary Service Data), IVR (Interactive Voice Response), and Android apps, but SMS is the most heavily used channel.

Surveys are key to our overall mission, as they give our users a tool to better understand their target audiences — usually their customers or beneficiaries. To optimize the effectiveness of this tool, one thing that we really wanted to do was identify key factors that lead to more people completing surveys sent by our users from the Echo platform. This would enable us to advise our users on how to get more value from our platform through better engagement and understanding of their audiences.

The completion rate of a survey is the percentage of people who complete a survey after being invited to take part in it. We came up with different factors that we thought could effect the completion rate of surveys:

  • post_incentive: The incentive (a small amount of money or airtime) offered after completing the survey
  • invite_day_of_month: The date of the month a respondent was invited to the survey
  • invite_day_of_the_week: The day of the week a respondent was asked to take part in the survey
  • invite_hour: The hour of the day the respondent was invited to the survey
  • num_questions: The number of questions in the survey
  • reminded: whether the respondent was reminded to complete the survey or not
  • channel: The manner in which the survey was done. These were either by use of SMS, USSD, IVR, web, or Android app. SMS is the most popular channel and accounts for over 90% of surveys
  • completion_rate: Of those invited to a survey, the percentage that completed

We used the performance of surveys deployed from the beginning of 2017 to August of 2017 to look for the correlations between the sample factors above. The correlations between the factors are shown in the table below. Since the focus was more on how the completion rate relates with other factors, I will focus on those relationships more.

The bigger the correlation magnitude, the stronger the correlation relationship. A positive correlation indicates that when one factor is increased the other should also increase. For a negative correlation value, the relationship is inverse. When one increases, the other decreases.

Correlations between different survey factors. completion_rate has the strongest correlation with invite_hour

The rows of the table are arranged in a descending order of the correlation between completion rate and other factors. Looking at the table, invite_hour with a positive correlation of 0.25 is the factor with strongest correlation with the completion rate. It is then followed by reminded while invite_day_of_the_month is the most negatively correlated with the completion_rate. The correlation between any other factors can also be obtained from the table, for example the correlation between number_of_questions and reminded is 0.05.

Survey completion causations?

The findings above can lead to incorrect conclusions if one is not careful. For example, a conclusion can be made that the invite hour with a correlation of 0.25 has the highest causal influence on the completion_rate of a survey. As a result, you might start trying to find the right time to send out surveys with the hope of getting more of them completed. With this mentality, it might be concluded that some invite hour is the optimum time to send out a survey. But that would be to hold to the (incorrect) idea that correlation implies causation.

The high correlation may mean that either one factor causes the other, the factors jointly cause each other, both factors are caused by the same separate third factor, or even that the correlation is as a result of coincidence.

We can, therefore, see that correlation does not always imply causation. With careful investigation, however, it is possible to more confidently conclude whether correlation implies that one variable causes the other.

How can we verify if correlation might imply causation?

1. Use statistically sound techniques to determine the relationship.

Ensure that you use statistically legitimate methods to find the correlation. These include:

  • use of variables that correctly quantify the relationship.
  • make sure there are no outliers .
  • ensure the sample is an appropriate representation of the population.
  • use of an appropriate correlation coefficient based on the scales of the relationship metrics.

2. Explain the relationships found

  • exposure always precedes the outcome. If A is supposed to cause B, check that A always occurs before B.
  • check if the relationship ties in with other existing theories.
  • check if the proposed relationship is similar to other relationships in related fields.
  • check if there is no other relationship that can explain the relationship. In the case above, a proper explanation for the headaches could be drinking instead of sleeping with shoes.

3. Validate the relationships

  • Conditions 1 and 2 above should be tested to determine if they are true or false. The common methods of testing are experiments and checking for consistency of the relationship. An experiment usually requires a model of the relationship, a testable hypothesis based on the model, incorporation of variance control measures, collection of suitable metrics for the relationship, and an appropriate analysis. Experiments done several times should lead to consistent conclusions.

We have not yet carried out these tests on our completion rate correlations. So we don’t yet know, for example, whether particular invite hours cause higher completion rates — only whether they are correlated.

Conclusion

We need to be careful before concluding that a particular relationship implies causation. It is generally better not to have a conclusion than to land on an incorrect one which might lead to wrong actions being taken!


The original version of this post was written by Rodgers Kim.  Kim works at Echo Mobile as a Software Engineer and is interested in data science and enjoys writing.

Integrating MERL with program design is good program management

by Yaquta Fatehi, Program Manager of Performance Measurement at the William Davidson Institute at the University of Michigan; and Heather Esper, Senior Program Manager of Performance Measurement at the William Davidson Institute at the University of Michigan.

At MERL Tech DC 2018, we — Yaquta Fatehi and Heather Esper — led a session titled “Integrating MERL with program design: Presenting an approach to balance your MERL strategy with four principles” The session focused on our experience of implementing this approach.

The challenge: There are a number of pressing tensions and challenges in development programs related to MERL implementation. These include project teams and MERL teams working in silos and, just as importantly, leadership’s lack of understanding and commitment to MERL (as leadership often views MERL only in terms of accountability). And while there are solutions developed to address some of these challenges, our consortium, the Balanced Design, Monitoring, Evaluation, Research, and Learning (BalanceD-MERL) consortium (under U.S. Agency for International Development’s (USAID’s) MERLIN program) saw that there was still a strong need for integration of MERL in program design for good program management and adaptive management. We chose four principles – relevant, right-sized, responsible, and trustworthy – to guide this approach to enable sustainable integration of MERL with program design and adaptive management. Definitions of the principles can be found here.

How to integrate program design and MERL (a case example): Our consortium aimed to identify the benefits of such integration and application of these principles in the Women + Water Global Development Alliance program. The Alliance is a five year public/private partnership between USAID and Gap, Inc., and four other non-profit sector partners. The Alliance draws upon these organizations’ complementary strengths to improve and sustain the health and well-being of women and communities touched by the apparel industry in India. Gap, Inc. had not partnered with USAID before and had limited experience with MERL on a complex program such as this which consisted of multiple individual activities or projects implemented by multiple partners. The BalanceD-MERL consortium’s services were requested during the program design stage, to develop a rigorous program-wide, high-level, MERL strategy. We proposed co-developing the MERL activities with the Women + Water partners as listed in the MERL Strategy Template (see Table 1 in the case study shared below) – that has been developed by our consortium partner – Institute for Development Impact.

Our first step was to co-design the program’s theory of change with the Women + Water partners to establish a shared understanding of what was the problem and how it was to be addressed by the program. We used the theory of change as a communication asset that helped bring a shared understanding of the solution among partners. We found that through this process we also identified gaps in the program design that could then be addressed, in turn making the program design stronger. Grounded by the theory of change in order to be relevant and trustworthy, we co-developed a risk matrix, which was one of the most useful exercises for Gap, Inc. because it helped them place judgment on their assumptions and identify risks that needed to be frequently monitored. Following this, we co-identified the key performance indicators and associated metadata using the Performance Indicator Reference Sheets format. This exercise, done iteratively with all partners, helped them understand the tradeoffs between trustworthy and right-size; helped to ensure the feasibility of data collection and that indicators were right-sized and relevant; verified that methods were responsible and not placing unnecessary burden on key stakeholders; and confirmed that data was trustworthy enough to provide insights on the activity’s progress and changing context.

In order to integrate MERL with the program design, we closely co-created these key components with the partners. We also co-developed questions for a learning agenda and recommended adaptive management tasks such as quarterly pause and reflect sessions so that leadership and program managers could make necessary adaptations to the program based on performance data. The consortium was also tasked with developing the performance management information system.

Findings: Through this experience, we found that the theory of change can serve as a key tool to integrate MERL with program design and it can form the foundation on which to build remaining MERL activities. Additionally, indeed, MERL can be compromised by an immature program design that has been informed by an incomplete needs assessment. For all key takeaways from this experience of applying the approach and principles as well as action items for program and MERL practitioners and key questions for leadership, please see the following case study.

All in all, it was an engaging session and we heard good questions and comments from our audience. To learn more or if you have any questions on the approach, feel free to email us at wdi-performancemeasurement@umich.edu

This publication was produced by William Davidson Institute at the University of Michigan (WDI) in collaboration with World Vision (WV) under the BalanceD-MERL Program, Cooperative Agreement Number AID-OAA-A-15-00061, funded by the U.S. Agency for International Development (USAID). This study/ report/ audio/ visual/other information/ media product (specify) is made possible by the generous support of the American people through the USAID. The contents are the responsibility of the William Davidson Institute and World Vision and do not necessarily reflect the views of USAID or the United States Government.

Using Social Network Analysis and Feedback to Measure Systems Change

by Alexis Smart, Senior Technical Officer, and Alexis Banks, Technical Officer, at Root Change

As part of their session at MERL Tech DC 2018, Root Change launched Pando, an online platform that makes it possible to visualize, learn from, and engage with the systems where you work. Pando harnesses the power of network maps and feedback surveys to help organizations strengthen systems and improve their impact.

Decades of experience in the field of international development has taught our team that trust and relationships are at the heart of social change. Our research shows that achieving and sustaining development outcomes depends on the contributions of multiple actors embedded in thick webs of social relationships and interactions. However, traditional MERL approaches have failed to help us understand the complex dynamics within those relationships. Pando was created to enable organizations to measure trust, relationships, and accountability between development actors.

Relationship Management & Network Maps

Grounded in social network analysis, Pando uses web-based relationship surveys to identify diverse organizations within a system and track relationships in real time. The platform automatically-generates a network map that visualizes the organizations and relationships within asystem. Data filters and analysis tools help uncover key actors, areas ofcollaboration, and network structures and dynamics.

Feedback Surveys & Analysis

Pando is integrated with Keystone Accountability’s Feedback Commons, an online tool that gives map administrators the ability to collect and analyze feedback about levels of trust and relationship quality among map participants. The combined power of network maps and feedback surveys helps create a holistic understanding of the system of organizations that impact a social issue, facilitate dialogue, and track change over time as actors work together to strengthen the system.

Examples of Systems Analysis

During Root Change’s session, “Measuring Complexity: A Real-Time Systems Analysis Tool,”Root Change Co-Founder, Evan Bloom and Senior Technical Officer, Alexis Smart, highlighted four examples of using network analysis to create social change from our work:

  • Evaluating Local Humanitarian ResponseSystems: We worked with the Harvard Humanitarian Institute (HHI) to evaluate the effect of local capacity development efforts on local ownership within humanitarian response networks in the Philippines, Kenya, Myanmar, and Ethiopia. Using social network analysis, Root Change and HHI assessed the roles of local and international organizations within each network to determine thedegree to which each system was locally-led.
  • Supporting Collective Impact in Nigeria: Network mapping has also been used in the USAID funded Strengthening Advocacy and Civic Engagement (SACE) project in Nigeria. Over five years, more than 1,300 organizationsand 2,000 relationships across 17 advocacy issue areas were identified andtracked. Nigerian organizations used the map to form meaningful partnerships,set common agendas, coordinate strategies, and hold the government accountable.
  • Informing Project Design in Kenya – Root Change and the Aga Khan Foundation (AKF) collected relationship data from hundreds of youth and organizations supporting youth opportunities in coastal Kenya. Analysis revealed gaps in expertise within the system, and opportunities to improve relationships among organizations and youth. These insights helped inform AKF’s program design, and ongoing mapping will be used to monitor system change. 
  • Tracking Local Ownership: This year, under USAID Local Works, Root Change is working with USAID missions to measure local ownership of development initiatives using newly designed localization metrics on Pando. USAID Bosnia and Herzegovina (BiH) launched a national Local Works map, identifying over 1,000 organizations working together on community development. Root Change and USAID BiH are exploring a pilot to use this map to continue to collect data and track localization metrics and train a local organization to support with this process.
     

Join the MERL Tech DC Network Map

As part of the MERL Tech DC 2018 conference, Root Change launched a map of the MERL Tech community. Event participants were invited to join this collaborative mapping effort to identify and visualize the relationships between organizations working to design, fund, and implement technology that supports monitoring, evaluation, research, and learning (MERL) efforts in development.

It’s not too late to join! Email info@mypando.org for an invitation to join the MERL Tech DC map and a chance to explore Pando.

Learn more about Pando

Pando is the culmination of more than a decade of experience providing training and coaching on the use of social network analysis and feedback surveys to design, monitor, and evaluate systems change initiatives. Initial feedback from international and local NGOs, governments, community-based organizations, and more is promising. But don’t take our word for it. We want to hear from you about ways that Pando could be useful in your social impact work. Contact us to discuss ways Pando could be applied in your programs.

Blockchain for International Development: Using a Learning Agenda to Address Knowledge Gaps

Guest post by John Burg, Christine Murphy, and Jean Paul Pétraud, international development professionals who presented a one-hour session at the  MERL Tech DC 2018 conference on Sept. 7, 2018. Their presentation focused on the topic of creating a learning agenda to help MERL practitioners gauge the value of blockchain technology for development programming. Opinions and work expressed here are their own.

We attended the MERL Tech DC 2018 conference held on Sept. 7, 2018 and led a session related to the creation of a learning agenda to help MERL practitioners gauge the value of blockchain technology for development programming.

As a trio of monitoring, evaluation, research, and learning, (MERL) practitioners in international development, we are keenly aware of the quickly growing interest in blockchain technology. Blockchain is a type of distributed database that creates a nearly unalterable record of cryptographically secure peer-to-peer transactions without a central, trusted administrator. While it was originally designed for digital financial transactions, it is also being applied to a wide variety of interventions, including land registries, humanitarian aid disbursement in refugee camps, and evidence-driven education subsidies. International development actors, including government agencies, multilateral organizations, and think tanks, are looking at blockchain to improve effectiveness or efficiency in their work.

Naturally, as MERL practitioners, we wanted to learn more. Could this radically transparent, shared database managed by its users, have important benefits for data collection, management, and use? As MERL practice evolves to better suit adaptive management, what role might blockchain play? For example, one inherent feature of blockchain is the unbreakable and traceable linkages between blocks of data. How might such a feature improve the efficiency or effectiveness of data collection, management, and use? What are the advantages of blockchain over other more commonly used technologies? To guide our learning we started with an inquiry designed to help us determine if, and to what degree, the various features of blockchain add value to the practice of MERL. With our agenda established, we set out eagerly to find a blockchain case study to examine, with the goal of presenting our findings at the September 2018 MERL Tech DC conference.

What we did

We documented 43 blockchain use-cases through internet searches, most of which were described with glowing claims like “operational costs… reduced up to 90%,” or with the assurance of “accurate and secure data capture and storage.” We found a proliferation of press releases, white papers, and persuasively written articles. However, we found no documentation or evidence of the results blockchain was purported to have achieved in these claims. We also did not find lessons learned or practical insights, as are available for other technologies in development.

We fared no better when we reached out directly to several blockchain firms, via email, phone, and in person. Not one was willing to share data on program results, MERL processes, or adaptive management for potential scale-up. Despite all the hype about how blockchain will bring unheralded transparency to processes and operations in low-trust environments, the industry is itself opaque. From this, we determined the lack of evidence supporting value claims of blockchain in the international development space is a critical gap for potential adopters.

What we learned

Blockchain firms supporting development pilots are not practicing what they preach — improving transparency — by sharing data and lessons learned about what is working, what isn’t working, and why. There are many generic decision trees and sales pitches available to convince development practitioners of the value blockchain will add to their work. But, there is a lack of detailed data about what happens when development interventions use blockchain technology.

Since the function of MERL is to bridge knowledge gaps and help decision-makers take action informed by evidence, we decided to explore the crucial questions MERL practitioners may ask before determining whether blockchain will add value to data collection, management, and use. More specifically, rather than a go/no-go decision tool, we propose using a learning agenda to probe the role of blockchain in data collection, data management and data use at each stage of project implementation.   “Before you embark on that shiny blockchain project, you need to have a very clear idea of why you are using a blockchain.”  

Avoiding the Pointless Blockchain Project, Gideon Greenspan (2015)

Typically, “A learning agenda is a set of questions, assembled by an organization or team, that identifies what needs to be learned before a project can be planned and implemented.” The process of developing and finding answers to learning questions is most useful when it’s employed continuously throughout the duration of project implementation, so that changes can be made based on what is learned about changes in the project’s context, and to support the process of applying evidence to decision-making in adaptive management.

We explored various learning agenda questions for data collection, management and use that should continue to be developed and answered throughout the project cycle. However, because the content of a learning agenda is highly context-dependent, we focused on general themes. Examples of questions that might be asked by beneficiaries, implementing partners, donors, and host-country governments, include:

  • What could each of a project’s stakeholder groups gain from the use of blockchain across the stages of design and implementation, and, would the benefits of blockchain incentivize them to participate?
  • Can blockchain resolve trust or transparency issues between disparate stakeholder groups, e.g. to ensure that data reported represent reality, or that they are of sufficient quality for decision-making?
  • Are there less-expensive, more appropriate, or easier to execute, existing technologies that already meet each group’s MERL needs?
  • Are there unaddressed MERL management needs blockchain could help address, or capabilities blockchain offers that might inspire new and innovative thinking about what is done, and how it gets done?

This approach resonated with other MERL for development practitioners

We presented this approach to a diverse group of professionals at MERL Tech DC, including other MERL practitioners and IT support professionals, representing organizations from multilateral development banks to US-based NGOs. Facilitated as a participatory roundtable, the session participants discussed how MERL professionals could use learning agendas to help their organizations both decide whether blockchain is appropriate for intervention design, as well as guide learning during implementation to strengthen adaptive management.

Questions and issues raised by the session participants ranged widely, from how blockchain works, to expressing doubt that organizational leaders would have the risk appetite required to pilot blockchain when time and costs (financial and human resource) were unknown. Session participants demonstrated an intense interest in this topic and our approach. Our session ran over time and side conversations continued into the corridors long after the session had ended.

Next Steps

Our approach, as it turns out, echoes others in the field who question whether the benefits of blockchain add value above and beyond existing technologies, or accrue to stakeholders beyond the donors that fund them. This trio of practitioners will continue to explore ways MERL professionals can help their teams learn about the benefits of blockchain technology for international development. But, in the end, it may turn out that the real value of blockchain wasn’t the application of the technology itself, but rather as an impetus to question what we do, why we do it, and how we could do it better.

Creative Commons License
Blockchain for International Development: Using a Learning Agenda to Address Knowledge Gaps by John Burg, Christine Murphy, and Jean-Paul Petraud is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Using Real-Time Data to Improve International Development Programming

by Erica Gendell, Program Analyst at USAID; and Rebecca Saxton-Fox, ICT Policy Advisor at USAID

Real-time data applications in international development

There are a wide range of applications of real-time data in international development programs, including:

  • Gathering demographic and assessment data following trainings, in order to improve outputs and outreach for future trainings;
  • Tracking migration flows following natural disasters to understand population locations and best locate relief efforts;
  • Analyzing real-time disease outbreak data to understand where medical resources will be most effectively deployed; and
  • Analyzing of radio and social media to understand and adapt communication outreach.

Using digital tools (such as mobile phone based text messaging, web-based applications, social media platforms, etc.) or large digital datasets (such as satellite or cell phone tower data) for collecting real-time data helps programs and projects respond quickly to community needs or potentially changing circumstances on the ground. However, these digital tools and datasets are often not well understood or mapped into decision-making processes.

Real Example of Real-time Data

In USAID/Ghana’s ADVANCE II program, project staff implemented a smart card ID technology that collects and stores data in an effort to have more accurate monitoring and evaluation data on project beneficiaries. The ID cards allowed USAID and project officers to see real-time results and build more effective and targeted programming. ADVANCE II has been successful in providing unique beneficiary data for over 120,000 people who participated in 5,111 training sessions. This information enabled the project to increase the number of trainings tailored to female farmers, a previously underrepresented population in trainings. This is a great example of how to incorporate data use and digital tools into a project or activity.

Data to Action Framework

At MERL Tech DC, we presented the ADVANCE II project as a way to use the “Data to Action” Framework. This is one approach to map how information flows and how decisions are made across a set of stakeholders in a program. It can be used as a conversation tool to identify barriers to action. You can also use it to identify where digital tools could help move information to decision makers faster.

This framework is just one tool to start thinking about uses of real-time data to enable adaptive management in development programs.

USAID explores these and other topics in a newly released portfolio of research on Real-time Data for Adaptive Management (RTD4AM), which give insight into the barriers to real-time data use in development. We look forward to continuing to build the community of practice of adaptive management within the MERL community.

 

 

We Wrote the Book on Evaluation Failures. Literally.

by Isaac D. Castillo, Director of Outcomes, Assessment, and Learning at Venture Philanthropy Partners.

Evaluators don’t make mistakes.

Or do they?

Well, actually, they do. In fact, I’ve got a number of fantastic failures under my belt that turned into important learning opportunities. So, when I was asked to share my experience at the MERL Tech DC 2018 session on failure, I jumped at the chance.

Part of the Problem

As someone of Mexican descent, I am keenly aware of the problems that can arise when culturally and linguistically inappropriate evaluation practices are used. However, as a young evaluator, I was often part of the problem.

Early in my evaluation career, I was tasked with collecting data to determine why teenage youth became involved in gangs. In addition to developing the interview guides, I was also responsible for leading all of the on-site interviews in cities with large Latinx populations. Since I am Latinx, I had a sufficient grasp of Spanish to prepare the interview guides and conduct the interviews. I felt confident that I would be sensitive to all of the cultural and linguistic challenges to ensure an effective data collection process. Unfortunately, I had forgotten an important tenet of effective culturally competent evaluation: cultures and languages are not monolithic. Differences in regional cultures or dialects can lead even experienced evaluators into embarrassment, scorn, or the worst outcome of all: inaccurate data.

Sentate, Por Favor

For example, when first interacting with the gang members, I introduced myself and asked them to “Please sit down,” to start the interview by saying “Siéntate, por favor.” What I did not know at the time is that a large portion of the gang members I was interviewing were born in El Salvador or were of Salvadoran descent, and the accurate way to say it using Salvadoran Spanish would have been, “Sentate, por favor.”

Does one word make that much difference? In most cases it did not matter, but it caused several gang members to openly question my Spanish from the outset, which created an uncomfortable beginning to interviews about potentially sensitive subjects.

Amigo or Chero?

I next asked the gang members to think of their “friends.” In most dialects of Spanish, using amigos to ask about friends is accurate and proper. However, in the context of street slang, some gang members prefer the term chero, especially in informal contexts.

Again, was this a huge mistake? No. But it did lead to enough quizzical looks and requests for clarification that started to doubt if I was getting completely honest or accurate answers from some of the respondents. Unfortunately, this error did not arise until I had conducted nearly 30 interviews. I had not thought to test the wordings of the questions in multiple Spanish-speaking communities across several states.

Would You Like a Concha?

Perhaps my most memorable mistake during this evaluation occurred after I had completed an interview with a gang leader outside of a bakery. After we were done, the gang leader called over the rest of his gang to meet me. As I was meeting everyone, I glanced inside the bakery and noticed a type of Mexican pastry that I enjoyed as a child. I asked the gang leader if he would like to go inside and join me for a concha, a round pastry that looks like a shell. Everyone (except me) began to laugh hysterically. The gang leader then let me in on the joke. He understood that I was asking about the pan dulce (sweet bread), but he informed me that in his dialect, concha was used as a vulgar reference to female genitalia. This taught me a valuable lesson about how even casual references or language choices can be interpreted in many different ways.

What did I learn from this?

While I can look back on these mistakes and laugh, I am also reminded of the important lessons learned that I carry with me to this day.

  • Translate with the local context in mind. When translating materials
    or preparing for field work, get a detailed sense of who you will be collecting data from, including what cultures and subgroups people represent and whether or not there are specific topics or words that should be avoided.
  • Translate with the local population in mind. When developing data collection tools (in any language, even if you are fluent in it), take the time to pre-test the language in the tools.

Be okay with your inevitable mistakes. Recognize that no matter how much preparation you do, you will make mistakes in your data collection related to culture and language issues. Remember it is how you respond in those situations that is most important.

As far as failures like this go, it turns out I’m in good company. My story is one of 22 candid, real-life examples from seasoned evaluators that are included in Kylie Hutchinson’s new book, Evaluation Failures: 22 Tales of Mistakes Made and Lessons Learned. Entertaining and informative, I guarantee it will give you plenty of opportunities to reflect and learn.

3 Lessons Learned using Machine Learning to Measure Media Quality

by Samhir Vasdev, Technical Adviser for Digital Development at IREX’s Center for Applied Learning and Impact. The post 3 Lessons Learned using Machine Learning to Measure Media Quality appeared first on ICTworks.

Moving from hype to practice is an important but challenging step for ICT4D practitioners. As the technical adviser for digital development at IREX, a global development and education organization, I’ve been watching with cautious optimism as international development stakeholders begin to explore how artificial intelligence tools like machine learning can help them address problems and introduce efficiencies to amplify their impact.

So while USAID was developing their guide to making machine learning work for international development and TechChange rolled out their new course on Artificial Intelligence for International Development, we spent a few months this summer exploring whether we could put machine learning to work to measure media quality.

Of course, we didn’t turn to machine learning just for the sake of contributing to the “breathless commentary of ML proponents” (as USAID aptly puts it).

As we shared in a session with our artificial intelligence partner Lore at MERLTech DC 2018, some of our programs face a very real set of problems that could be alleviated through smarter use of digital tools.

Our Machine Learning Experiment

In our USAID-funded Media Strengthening Program in Mozambique, for example, a small team of human evaluators manually score thousands of news articles based on 18 measures of media quality.

This process is time consuming (some evaluators spend up to four hours a day reading and evaluating articles), inefficient (when staff turns over, we need to reinvest resources to train up new hires), and inconsistent (even well-trained evaluators might score articles differently).

To test whether we can make the process of measuring media quality less resource-intensive, we spent a few months training software to automatically detect one of these 18 measures of media quality: whether journalists keep their own opinions out of their news articles. The results of this experiment are very compelling:

  • The software had 95% accuracy in recognizing sentences containing opinions within the dataset of 1,200 articles.
  • The software’s ability to “learn” was evident. Anecdotally, the evaluation team noticed a marked improvement in the accuracy of the software’s suggestions after showing it only twenty sentences that had opinions. The accuracy, precision, and recall results highlighted above were achieved after only sixteen rounds of training the software.
  • Accuracy and precision increased the more that the model was trained. There is a clear relationship between the number of times the evaluators trained the software and the accuracy and precision of the results. The recall results did not improve over time as consistently.

These results, although promising, simplify some numbers and calculations. Check out our full report for details.

What does this all mean? Let’s start with the good news. The results suggest that some parts of media quality—specifically, whether an article is impartial or whether it echoes its author’s opinions—can be automatically measured by machine learning.

The software also introduces the possibility of unprecedented scale, scanning thousands of articles in seconds for this specific indicator. These implications introduce ways for media support programs to spend their limited resources more efficiently.

3 Lessons Learned from using Machine Learning

Of course, the machine learning experience was not without problems. With any cutting-edge technology, there will be lessons we can learn and share to improve everyone’s experience. Here are our three lessons learned working with machine learning:

1. Forget about being tech-literate; we need to be more problem-literate.

Defining a coherent, specific, actionable problem statement was one of the important steps of this experiment. This wasn’t easy. Hard trade-offs had to be made (Which of 18 indicators should we focus on?), and we had to focus on things we could measure in order to demonstrate efficiency games using this new approach (How much time do evaluators currently spend scoring articles?).

When planning your own machine learning project, devote plenty of time at the outset—together with your technology partner—to define the specific problem you’ll try to address. These conversations result in a deeper shared understanding of both the sector and the technology that will make the experiment more successful.

2. Take the time to communicate results effectively.

Since completing the experiment, people have asked me to explain how “accurate” the software is. But in practice, machine learning software uses different methods to define “accuracy”, which in turn can vary according to the specific model (the software we used deploys several models).

What starts off as a simple question (How accurate is our software?) can easily turn into a discussion of related concepts like precision, recall, false positives, and false negatives. We found that producing clean visuals (like this or this) became the most effective way to explain our results.

3. Start small and manage expectations.

Stakeholders with even a passing awareness of machine learning will be aware of its hype. Even now, some colleagues ask me how we “automated the entire media quality assessment process”—even though we only used machine learning to identify one of 18 indicators of media quality. To help mitigate inflated expectations, we invested a small amount into this “minimum viable product” (MVP) to prove the fundamental concept before expanding on it later.

Approaching your first machine learning project this way might help to keep expectations in line with reality, minimize risks associated with experimentation, and provide air cover for you to adjust your scope as you discover limitations or adjacent opportunities during the process.

How does GlobalGiving tell whether it’s having an impact?

by Nick Hamlin, Data Scientist at Global Giving. This post was originally published here on October 1, 2018, titled “How Can We Tell if GlobalGiving is Making an Impact,” The full study can be found here.

Our team wanted to evaluate our impact, so we applied a new framework to find answers.


What We Tested

Every social organization, GlobalGiving included, needs to know if it’s having an impact on the communities it serves. For us, that means understanding the ways in which we are (or aren’t!) helping our nonprofit partners around the world improve their own effectiveness and capacity to create change, regardless of the type of work they do.

Why It Matters

Without this knowledge, social organizations can’t make informed decisions about the strategies to use to deliver their services. Unfortunately, this kind of rigorous impact evaluation is usually quite expensive and can take years to carry out. As a result, most organizations struggle to evaluate their impact.

We knew the challenges going into our own impact research would be substantial, but it was too important for us not to try.

The Big Question

Do organizations with access to GlobalGiving’s services improve their performance differently than organizations that don’t? Are there particular focus areas where GlobalGiving is having more of an impact than others?

Our Method

Ideally, we’d randomly assign certain organizations to receive the “treatment” of being part of GlobalGiving and then compare their performance with another randomly assigned control group. But, we can’t just tell random organizations that they aren’t allowed to be part of our community. So, instead we compared a treatment group—organizations that have completed the GlobalGiving vetting process and become full partners on the website—with a control group of organizations that have successfully passed the vetting process but haven’t joined the web community. Since we can’t choose these groups randomly, we had to ensure the organizations in each group are as similar as possible so that our results aren’t biased by underlying differences between the control and treatment groups.

To do this, we worked only with organizations based in India. We chose India because we have lots of relationships with organizations there, and we needed as large a sample size as possible to increase confidence that our conclusions are reliable. India is also well-suited for this study because it requires organizations to have special permission to receive funds from overseas under the Foreign Contribution Regulation Act (FCRA). Organizations must have strong operations in place to earn this permission. The fact that all participant organizations are established enough to earn both an FCRA certification and pass GlobalGiving’s own high vetting standards means that any differences in our results are unlikely to be caused by geographic or quality differences.

We also needed a way to measure nonprofit performance in a concrete way. For this, we used the “Organizational Performance Index” (OPI) framework created by Pact. The OPI provides a structured way to understand a nonprofit’s capacity along eight different categories, including its ability to deliver programs, the diversity of its funding sources, and its use of community feedback. The OPI scores organizations on a scale of 1 (lowest) to 4 (highest). With the help of a fantastic team of volunteers in India, we gathered two years of OPI data from both the treatment and control groups, then compared how their scores changed over time to get an initial indicator of GlobalGiving’s impact.

The Results

The most notable result we found was that organizations that were part of GlobalGiving demonstrated significantly more participatory planning and decision-making processes (what we call “community leadership”), and improved their use of stakeholder feedback to inform their work, in comparison to control group organizations. We did not see a similar significant result in the other seven categories that the OPI tracks. The easiest way to see this result is to visualize how organizations’ scores shifted over time. The chart below shows differences in target population scores—Pact’s wording for “community leadership and feedback.”

Differences in Target Population Score Changes

Differences in Target Population Score

For example, look at the organizations that started out with a score of two in the control group on the left. Roughly one third of those increased their score to three, one third stayed the same, and one third had their scores drop to one. In contrast, in the treatment group on the right, nearly half the organizations increased their scores and about half stayed the same, while only a tiny fraction dropped. You can see a similar pattern across the two groups regardless of their starting score.

In contrast, here’s the same diagram for another OPI category where we didn’t see a statistically significant difference between the two groups. There’s not nearly as clear a pattern—both the treatment and control organizations change their scores about the same amount.

Differences in Delivery Score Changes

GlobalGiving Impact Study Delivery Score Changes

For more technical details about our research design process, our statistical methodology, and the conclusions we’ve drawn, please check out the full write-up of this work, which is available on the Social Science Research Network.

The Ultimate Outcome

GlobalGiving spends lots of time focusing on helping organizations use feedback to become more community-led, because we believe that’s what delivers greater impact.

Our initial finding—that our emphasis on feedback is having a measurable impact—is an encouraging sign.

On the other hand, we didn’t see that GlobalGiving was driving significant changes in any of the other seven OPI categories. Some of these categories, like adherence to national or international standards, aren’t areas where GlobalGiving focuses much. Others, like how well an organization learns over time, are closely related to what we do (Listen, Act, Learn. Repeat. is one of our core values). We’ll need to continue to explore why we’re not seeing results in these areas and, if necessary, make adjustments to our programs accordingly.

Make It Yours

Putting together an impact study, even a smaller one like this, is a major undertaking for any organization. Many organizations talk about applying a more scientific approach to their impact, but few nonprofits or funders take on the challenge of carrying out the research needed to do so. This study demonstrates how organizations can make meaningful progress towards rigorously measuring impact, even without a decade of work and an eight-figure budget.

If your organization is considering something similar, here are a few suggestions to keep in mind that we’ve learned as a result of this project:

1. If you can’t randomize, make sure you consider possible biases.

    •  Logistics, processes, and ethics are all reasons why an organization might not be able to randomly assign treatment groups. If that’s the case for you, think carefully about the rest of your design and how you’ll reduce the chance that a result you see can be attributed to a different cause.

2. Choose a measurement framework that aligns with your theory of change and is precise as possible.

    •  We used the OPI because it was easy to understand, reliable, and well-accepted in the development sector. But, the OPI’s four-level scale made it difficult to make precise distinctions between organizations, and there were some categories that didn’t make sense in the context of how GlobalGiving works. These are areas we’ll look to improve in future versions of this work.

3. Get on the record. 

    • Creating a clear record of your study, both inside and outside your organization, is critical for avoiding “scope creep.” We used Git to keep track of all changes in our data, code, and written analysis, and shared our initial study design at the 2017 American Evaluation Association conference.

4. Enlist outside help. 

    This study would not have been possible without lots of extra help, from our volunteer team in India, to our friends at Pact, to the economists and data scientists who checked our math, particularly Alex Hughes at UC Berkeley and Ted Dunmire at Booz Allen Hamilton.

We’re pleased about what we’ve learned about GlobalGiving’s impact, where we can improve, and how we might build on this initial work, and we can’t wait to continue to build on this progress moving forward in service of improved outcomes for our nonprofit partners worldwide.

Find the full study here.

MERL and the 4th Industrial Revolution: Submit your AfrEA abstract now!

by Dhashni Naidoo, Genesis Analytics

Digitization is everywhere! Digital technologies and data have changed the way we engage with each other and how we work. We cannot escape the effects of digitization. Whether in our personal capacity — how our own data is being used — or in our professional capacity, in terms of understanding how to use data and technology. These changes are exciting! But we also need to consider the challenges they present to the MERL community and their impact on development.

The advent and proliferation of big data has the potential to change how evaluations are conducted. New skills are needed to process and analyse big data. Mathematics, statistics and analytical skills will be ever more important. As evaluators, we need to be discerning about the data we use. In a world of copious amounts of data, we need to ensure we have the ability to select the right data to answer our evaluation questions.

We also have an ethical and moral duty to manage data responsibly. We need new strategies and tools to guide the ways in which we collect, store, use and report data. Evaluators need to improve our skills as related to processing and analysing data. Evaluative thinking in the digital age is evolving and we need to consider the technical and soft skills required to maintain integrity of the data and interpretation thereof.

Though technology can make data collection faster and cheaper, two important considerations are access to technology by vulnerable groups and data integrity. Women, girls and people in rural areas normally do not have the same levels of access to technology as men and boys This impacts on our ability to rely solely on technology to collect data from these population groups, because we need to be aware of inclusion, bias and representativity. Equally we need to consider how to maintain the quality of data being collected through new technologies such as mobile phones and to understand how the use of new devices might change or alter how people respond.

In a rapidly changing world where technologies such as AI, Blockchain, Internet of Things, drones and machine learning are on the horizon, evaluators need to be robust and agile in how we change and adapt.

For this reason, a new strand has been introduced at the African Evaluation Association (AfrEA) conference, taking place from 11 – 15 March 2019 in Abidjan, Cote d’Ivoire. This stream, The Fourth Industrial Revolution and its Impact on Development: Implications for Evaluation, will focus on five sub-themes:

  • Guide to Industry 4.0 and Next Generation Tech
  • Talent and Skills in Industry 4.0
  • Changing World of Work
  • Evaluating youth programmes in Industry 4.0
  • MERLTech

Genesis Analytics will be curating this strand.  We are excited to invite experts working in digital development and practitioners at the forefront of technological innovation for development and evaluation to submit abstracts for this strand.

The deadline for abstract submissions is 16 November 2018. For more information please visit the AfrEA Conference site!

Does your MERL Tech effort need innovation or maintenance?

by Stacey Berlow, Managing Partner at Project Balance and Jana Melpolder, MERL Tech DC Volunteer and Communications Manager at Inveneo. Find Jana on Twitter:  @JanaMelpolder

At MERL Tech DC 2018, Project Balance’s Stacey Berlow led a session titled “Application Maintenance Isn’t Sexy, But Critical to Success.” In her session and presentation, she outlined several reasons why software maintenance planning and funding is essential to the sustainability of an M&E software solution.

The problems that arise with software or applications go well beyond day-to-day care and management. A foundational study on software maintenance by P. Lientz and E. Burton [1] looked at the activities of 487 IT orgs and found that maintenance activities can be broken down into four types:

  • Corrective (bug fixing),
  • Adaptive (impacts due to changes outside the system),
  • Perfective (enhancements), and
  • Preventive (monitoring and optimization)

The table below outlines the percentage of time IT departments spend on the different types of maintenance. Note that most of the time dedicated to maintenance is not defect fixing (corrective), but enhancing (perfecting) the tool or system.

Maintenance Type Effort Breakdown
Corrective (Total: 21.7%) Emergency fixes: 12.4% 

Routine debugging: 9.3%

Adaptive (Total: 23.6%) Changes to data inputs and files: 17.4%

Changes to hardware and system software: 6.2% 

Perfective (Total: 51.3%) Customer enhancements: 41.8% 

Improvements to documentation: 5.5% 

Optimization: 4.0%

Other (Total: 3.4%) Various: 3.4%

The study also pointed out some of the most common maintenance problems:

  • Poor quality application system documentation
  • Excessive demand from customers
  • Competing demands for maintenance personnel time
  • Inadequate training of user personnel
  • Turnover in the user organizations

Does Your Project Need Innovations or Just Maintenance?

Organizations often prioritize innovation over maintenance. They have a list of enhancing strategies or improvements they want to make, and they’ll start new projects when what they should really be focusing on is maintenance. International development organizations often want to develop new software with the latest technology — they want NEW software for their projects. In reality, what is usually needed is software maintenance and enhancement of an existing product.

Moreover, when an organization is considering adopting a new piece of software, it’s absolutely vital that it think about the cost of maintenance in addition to the cost of development. Experts estimate that the cost of maintenance can vary from 40%-90% of the original build cost [2]. Maintenance costs a lot more than many organizations realize.

It’s also not easy to know beforehand or to estimate what the actual cost of maintenance will be. Creating a Service Level Agreement (SLA), which specifies the time required to respond to issues or deploy enhancements as part of a maintenance contract, is vital to having a handle on the human resources, price levels and estimated costs of maintenance.

As Stacey emphasizes, “Open Source does not mean ‘free’. Updates to DHIS2 versions, Open MRS, Open HIE, Drupal, WordPress, and more WILL require maintenance to custom code.”

It’s All About the Teamwork

Another point to consider when it comes to the cost of maintenance for your app or software is the time and money spent on staff. Members of your team will not always be well-versed in a certain type of software. Also, when transferring a software asset to a funder or ministry/government entity, consider the skill level of the receiving team as well as the time availability of team members. Many software products cannot be well maintained by teams that not involved in developing them. As a result, they often fall into disrepair and become unusable. A software vendor may be better equipped to monitor and respond to issues than the team.

What Can You Do?

So what are effective ways to ensure the sustainability of software tools? There’s a few strategies you can use. First of all, ensure that your IT staff members are involved in the planning of your project or organization’s RFP process. They will give you valuable metrics on efforts and cost, right up front, so that you can secure funding. Second, scale down the size of your project so that your tool budget matches your funds. Consider what the minimum software functionality is that you need, and enhance the tools later. Third, invite the right stakeholders and IT staff members to meetings and conference calls as soon as the project begins. Having the right people on board early on will make a huge difference in how you manage and transition software to country stakeholders later at the end of the project!

The session at MERL Tech ended with a discussion of the incredible need and value of involving local skills and IT experts as part of the programming team. Local knowledge and IT expertise is one of the most important, if not the most important, pieces of the application maintenance puzzle. One of the key ideas I learned was that application maintenance should start at the local level and grow from there. Local IT personnel will be able to answer many technical questions and address many maintenance issues. Furthermore, IT staff members from international development agencies will be able to learn from local IT experts as well, giving a boost in the capacity of all staff members across the board.

Application maintenance may not be the most interesting part of an international development project, but it is certainly one of the most vital to help ensure the project’s success and ongoing sustainability.

Check out this great Software Maintenance/Monitoring Checklist to ensure you’ve considered everything you need when planning your next MERL Tech (or other) effort!

[1] P. Lientz, E. Burton, Software Maintenance Management: A Study of the Maintenance of Computer Application Software in 487 Data Processing Organizations, Addison-Wesley (August 1, 1980)

[2] Reference: Jeff Hanby, Software Maintenance: Understanding and Estimating Costs, https://bit.ly/2Ob3iOn