Tag Archives: machine learning

3 Lessons Learned using Machine Learning to Measure Media Quality

by Samhir Vasdev, Technical Adviser for Digital Development at IREX’s Center for Applied Learning and Impact. The post 3 Lessons Learned using Machine Learning to Measure Media Quality appeared first on ICTworks.

Moving from hype to practice is an important but challenging step for ICT4D practitioners. As the technical adviser for digital development at IREX, a global development and education organization, I’ve been watching with cautious optimism as international development stakeholders begin to explore how artificial intelligence tools like machine learning can help them address problems and introduce efficiencies to amplify their impact.

So while USAID was developing their guide to making machine learning work for international development and TechChange rolled out their new course on Artificial Intelligence for International Development, we spent a few months this summer exploring whether we could put machine learning to work to measure media quality.

Of course, we didn’t turn to machine learning just for the sake of contributing to the “breathless commentary of ML proponents” (as USAID aptly puts it).

As we shared in a session with our artificial intelligence partner Lore at MERLTech DC 2018, some of our programs face a very real set of problems that could be alleviated through smarter use of digital tools.

Our Machine Learning Experiment

In our USAID-funded Media Strengthening Program in Mozambique, for example, a small team of human evaluators manually score thousands of news articles based on 18 measures of media quality.

This process is time consuming (some evaluators spend up to four hours a day reading and evaluating articles), inefficient (when staff turns over, we need to reinvest resources to train up new hires), and inconsistent (even well-trained evaluators might score articles differently).

To test whether we can make the process of measuring media quality less resource-intensive, we spent a few months training software to automatically detect one of these 18 measures of media quality: whether journalists keep their own opinions out of their news articles. The results of this experiment are very compelling:

  • The software had 95% accuracy in recognizing sentences containing opinions within the dataset of 1,200 articles.
  • The software’s ability to “learn” was evident. Anecdotally, the evaluation team noticed a marked improvement in the accuracy of the software’s suggestions after showing it only twenty sentences that had opinions. The accuracy, precision, and recall results highlighted above were achieved after only sixteen rounds of training the software.
  • Accuracy and precision increased the more that the model was trained. There is a clear relationship between the number of times the evaluators trained the software and the accuracy and precision of the results. The recall results did not improve over time as consistently.

These results, although promising, simplify some numbers and calculations. Check out our full report for details.

What does this all mean? Let’s start with the good news. The results suggest that some parts of media quality—specifically, whether an article is impartial or whether it echoes its author’s opinions—can be automatically measured by machine learning.

The software also introduces the possibility of unprecedented scale, scanning thousands of articles in seconds for this specific indicator. These implications introduce ways for media support programs to spend their limited resources more efficiently.

3 Lessons Learned from using Machine Learning

Of course, the machine learning experience was not without problems. With any cutting-edge technology, there will be lessons we can learn and share to improve everyone’s experience. Here are our three lessons learned working with machine learning:

1. Forget about being tech-literate; we need to be more problem-literate.

Defining a coherent, specific, actionable problem statement was one of the important steps of this experiment. This wasn’t easy. Hard trade-offs had to be made (Which of 18 indicators should we focus on?), and we had to focus on things we could measure in order to demonstrate efficiency games using this new approach (How much time do evaluators currently spend scoring articles?).

When planning your own machine learning project, devote plenty of time at the outset—together with your technology partner—to define the specific problem you’ll try to address. These conversations result in a deeper shared understanding of both the sector and the technology that will make the experiment more successful.

2. Take the time to communicate results effectively.

Since completing the experiment, people have asked me to explain how “accurate” the software is. But in practice, machine learning software uses different methods to define “accuracy”, which in turn can vary according to the specific model (the software we used deploys several models).

What starts off as a simple question (How accurate is our software?) can easily turn into a discussion of related concepts like precision, recall, false positives, and false negatives. We found that producing clean visuals (like this or this) became the most effective way to explain our results.

3. Start small and manage expectations.

Stakeholders with even a passing awareness of machine learning will be aware of its hype. Even now, some colleagues ask me how we “automated the entire media quality assessment process”—even though we only used machine learning to identify one of 18 indicators of media quality. To help mitigate inflated expectations, we invested a small amount into this “minimum viable product” (MVP) to prove the fundamental concept before expanding on it later.

Approaching your first machine learning project this way might help to keep expectations in line with reality, minimize risks associated with experimentation, and provide air cover for you to adjust your scope as you discover limitations or adjacent opportunities during the process.

How I Learned to Stop Worrying and Love Big Data

by Zach Tilton, a Peacebuilding Evaluation Consultant and a Doctoral Research Associate at the Interdisciplinary PhD in Evaluation program at Western Michigan University. 
 
In 2013 Dan Airley quipped “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it….” In 2015 the metaphor was imported to the international development sector by Ben Ramalingam, in 2016 it became a MERL Tech DC lightning talk, and has been ringing in our ears ever since. So, what about 2018? Well, unlike US national trends in teenage sex, there are some signals that big or at least ‘bigger’ data is continuing to make its way not only into the realm of digital development, but also evaluation. I recently attended the 2018 MERL Tech DC pre-conference workshop Big Data and Evaluation where participants were introduced to real ways practitioners are putting this trope to bed (sorry, not sorry). In this blog post I share some key conversations from the workshop framed against the ethics of using this new technology, but to do that let me first provide some background.
 
I entered the workshop on my heels. Given the recent spate of security breaches and revelations about micro-targeting, ‘Big Data’ has been somewhat of a boogie-man for myself and others. I have taken some pains to limit my digital data-footprint, have written passionately about big data and surveillance capitalism, and have long been skeptical of big data applications for serving marginalized populations in digital development and peacebuilding. As I found my seat before the workshop started I thought, “Is it appropriate or ethical to use big data for development evaluation?” My mind caught hold of a 2008 Evaluation Café debate between evaluation giants Michael Scriven and Tom Cook on causal inference in evaluation and the ethics of Randomized Control Trials. After hearing Scriven’s concerns about the ethics of withholding interventions from control groups, Cook asks, “But what about the ethics of not doing randomized experiments?” He continues, “What about the ethics of having causal information that is in fact based on weaker evidence and is wrong? When this happens, you carry on for years and years with practices that don’t work whose warrant lies in studies that are logically weaker than experiments provide.”
 
While I sided with Scriven for most of that debate, this question haunted me. It reminded me of an explanation of structural violence by peace researcher Johan Galtung who writes, “If a person died from tuberculosis in the eighteenth century it would be hard to conceive of this as violence since it might have been quite unavoidable, but if he dies from it today, despite all the medical resources in the world, then violence is present according to our definition.” Galtung’s intellectual work on violence deals with the difference between potential and the actual realizations and what increases that difference. While there are real issues with data responsibility, algorithmic biases, and automated discrimination that need to be addressed, if there are actually existing technologies and resources not being used to address social and material inequities in the world today, is this unethical, even violent? “What about the ethics of not using big data?” I asked myself back. The following are highlights of the actually existing resources for using big data in the evaluation of social amelioration.
 

Actually Existing Data

 
During the workshop, Kerry Bruce from Social Impact shared with participants her personal mantra, “We need to do a better job of secondary data analysis before we collect any more primary data.” She challenged us to consider how to make use of the secondary data available to our organizations. She gave examples of potential big data sources such as satellite images, remote sensors, GPS location data, social media, internet searches, call-in radio programs, biometrics, administrative data and integrated data platforms that merge many secondary data files such as public records and social service agency and client files. The key here is there are a ton of actually existing data, many of which are collected passively, digitally, and longitudinally. Despite noting real limitations to accessing existing secondary data, including donor reluctance to fund such work, limited training in appropriate methodologies in research teams, and differences in data availability between contexts, to underscore the potential of using secondary data, she shared a case study where she lead a team to use large amounts of secondary indirect data to identify ecosystems of modern day slavery at a significantly reduced cost than collecting the data first-hand. The outputs of this work will help pinpoint interventions and guide further research into the factors that may lead to predicting and prescribing what works well for stopping people from becoming victims of slavery.
 

Actually Existing Tech (and math)

 
Peter York from BCT Partners provided a primer on big data and data science including the reality-check that most of the work is the unsexy “ETL,” or the extraction, transformation, and loading of data. He contextualized the potential of the so-called big data revolution by reminding participants that the V’s of big data, Velocity, Volume, and Variety, are made possible by the technological and social infrastructure of increasingly networked populations and how these digital connections enable the monitoring, capturing, and tracking of ever increasing aspects of our lives in an unprecedented way. He shared, “A lot of what we’ve done in research were hacks because we couldn’t reach entire populations.” With advances in the tech stacks and infrastructure that connect people and their internet-connected devices with each other and the cloud, the utility of inferential statistics and experimental design lessens when entire populations of users are producing observational behavior data. When this occurs, evaluators can apply machine learning to discover the naturally occurring experiments in big data sets, what Peter terms ‘Data-driven Quasi-Experimental Design.’ This is exactly what Peter does when he builds causal models to predict and prescribe better programs for child welfare and juvenile justice to automate outcome evaluation, taking cues from precision medicine.
 
One example of a naturally occurring experiment was the 1854 Broad Street cholera outbreak in which physician John Snow used a dot map to identify a pattern that revealed the source of the outbreak, the Broad Street water pump. By finding patterns in the data John Snow was able to lay the groundwork for rejecting the false Miasma Theory and replace it with a proto-typical Germ Theory. And although he was already skeptical of miasma theory, by using the data to inform his theory-building he was also practicing a form of proto-typical Grounded Theory. Grounded theory is simply building theory inductively, after data collection and analysis, not before, resulting in theory that is grounded in data. Peter explained, “Machine learning is Grounded Theory on steroids. Once we’ve built the theory, found the pattern by machine learning, we can go back and let the machine learning test the theory.” In effect, machine learning is like having a million John Snows to pour over data to find the naturally occurring experiments or patterns in the maps of reality that are big data.
 
A key aspect of the value of applying machine learning in big data is that patterns more readily present themselves in datasets that are ‘wide’ as opposed to ‘tall.’ Peter continued, “If you are used to datasets you are thinking in rows. However, traditional statistical models break down with more features, or more columns.” So, Peter and evaluators like him that are applying data science to their evaluative practice are evolving from traditional Frequentist to Bayesian statistical approaches. While there is more to the distinction here, the latter uses prior knowledge, or degrees of belief, to determine the probability of success, where the former does not. This distinction is significant for evaluators who are wanting to move beyond predictive correlation to prescriptive evaluation. Peter expounded, Prescriptive analytics is figuring out what will best work for each case or situation.” For example, with prediction, we can make statements that a foster child with certain attributes is 70% not likely to find a home. Using the same data points with prescriptive analytics we can find 30 children that are similar to that foster child and find out what they did to find a permanent home. In a way, only using predictive analytics can cause us to surrender while including prescriptive analytics can cause us to endeavor.
 

Existing Capacity

The last category of existing resources for applying big data for evaluation was mostly captured by the comments of independent evaluation consultant, Michael Bamberger. He spoke of the latent capacity that existed in evaluation professionals and teams, but that we’re not taking full advantage of big data: “Big data is being used by development agencies, but less by evaluators in these agencies. Evaluators don’t use big data, so there is a big gap.”

He outlined two scenarios for the future of evaluation in this new wave of data analytics: a state of divergence where evaluators are replaced by big data analysts and a state of convergence where evaluators develop a literacy with the principles of big data for their evaluative practice. One problematic consideration with this hypothetical is that many data scientists are not interested in causation, as Peter York noted. To move toward the future of convergence, he shared how big data can enhance the evaluation cycle from appraisal and planning through monitoring, reporting and evaluating sustainability. Michael went on to share a series of caveats emptor that include issues with extractive versus inclusive uses of big data, the fallacy of large numbers, data quality control, and different perspectives on theory, all of which could warrant their own blog posts for development evaluation.

While I deepened my basic understandings of data analytics including the tools and techniques, benefits and challenges, and guidelines for big data and evaluation, my biggest take away is reconsidering big data for social good by considering the ethical dilemma of not using existing data, tech, and capacity to improve development programs, possibly even prescribing specific interventions by identifying their probable efficacy through predictive models before they are deployed.

(Slides from the Big Data and Evaluation workshop are available here).

Do you use or have strong feelings about big data for evaluation? Please continue the conversation below.