Machine learning for survey data quality control, but without the actual survey data
Post by Christopher Robert. Chris is a researcher, a consultant, and Founder Emeritus at Dobility (SurveyCTO). A version of this post was first published on Chris’ Linked In page.
Four years ago, I introduced a machine learning roadmap for SurveyCTO. In it, I advanced the uncontroversial theory that machine learning could improve how we do quality control for survey data. That ML models could be trained to recognize data problems was frankly obvious, but I made a further claim that was far less so: I suggested that ML models could be trained to recognize data problems without actually having to see any of the data itself. It seemed like a crazy idea — but early results have been promising, and we’re now on the verge of establishing the technical limits of a radically confidentiality-protecting approach to ML4QC.
If you’re collecting primary data with SurveyCTO or some other tool that can capture rich metadata or paradata, and you have a QC process designed to catch problems, you can potentially improve that QC process while, at the same time, helping to advance the science of ML4QC. Read on to find out how…
But why withhold survey data from ML models?
First, why not train ML models directly on the raw survey data itself? Surely they’d better identify problems in the data if they could actually see that data.
The answer is simple: confidentiality.
Survey data is often quite sensitive, including health, financial, and other personal information. The best way to safeguard the confidentiality of that data is to massively restrict who — and what systems — can access it. I’d obviously like to trust cloud ML services and data scientists, but the fact is that hack after hack after hack has demonstrated that nobody and nothing is safe. This is why end-to-end encryption has steadily gained in popularity and why it’s such a key feature of SurveyCTO.
And FWIW, it’s not just that I don’t trust others: I don’t even trust myself. Despite how obsessive I am about my own security practices, I know that the only sure way to prevent myself from being involved in a sensitive data breach is to avoid seeing or storing sensitive data to the greatest extent possible.
(Also: ML models trained on stuff that’s less survey-specific than the survey data itself might generalize better across settings, which could prove useful. But that’s a far lesser and more tenuous motivation than confidentiality.)
If they can’t see the actual survey data, what can ML models actually see?
The first thing ML4QC models need to be able to see is the result of human review: in order to train these models, there have to be actual people doing the training.
In SurveyCTO, this would typically be done via the review and correction workflow, where some subset of submissions are reviewed closely by human reviewers, either because they were flagged by automated checks or randomly selected. For interviews, the review should ideally include listening to audio audits, so that the reviewer gets an accurate sense of how well the interview was conducted, how faithfully the interviewer followed interview protocols, and how accurately responses were recorded. While QC processes differ, the default is for each submission to then be accepted or rejected and tagged with a quality classification of GOOD, OKAY, POOR, or FAKE.
The ML model’s job, then, is to predict the outcome of this human review, in order to guide the quality team to focus attention on cases most likely to be problematic. But if not the survey data, what can the ML models use to predict these outcomes? Metadata and paradata.
During interviews, SurveyCTO can collect rich metadata and paradata using text audits, sensor metadata, audio audits, and more (see here for a list of good metadata and paradata fields). This includes the specific path through the survey, how much time was spent on each question, which questions were revisited when, the patterns of sound, movement, and light observed during the interview, and more. Together, this data provides a rich portrait of the interview process itself.
Using such portraits, the idea is that trained ML models can learn to distinguish the difference between good and bad interviews. Such models won’t be able to detect a single typo in a response value, for example, because the assessment here is at the interview vs. the question level — but they should be able to detect the difference between a rushed interview full of typos and a more careful, thorough interview, or between an interview where protocols were generally followed vs. one where they weren’t. (And, for self-response surveys, the idea would be to recognize when there’s a real human respondent paying attention and trying to answer accurately vs. a bot or somebody just going through the motions.)
How to try ML4QC in your surveys (past, present, or future)
With generous funding and additional support from Dobility, Orange Chair Labs is now supporting SurveyCTO partners and users who want to implement and evaluate ML4QC methods in their survey projects. All of the work is being open-sourced in real-time, via the surveydata and ml4qc GitHub repositories, in order to help promote collaboration and advance the science.
In just the first proof-of-concept analysis involving a CATI survey in Afghanistan, ML algorithms were able to identify submissions with a 9x greater risk of being rejected by human review (relative to randomly-selected submissions), and even greater performance looks likely depending on the setting, review process, and metadata collected.
If you’re interested in this technology, you can:
- Share non-PII metadata and paradata from a recent survey. If you reviewed at least 10% of interviews (ideally including audio recordings), we can evaluate how well ML4QC techniques predict the results of your reviews.
- Pilot ML4QC techniques in an upcoming survey. As long as you have resources available to review at least 10% of interviews, we can implement the ML4QC workflow and use it to increasingly direct your reviews toward outliers and interviews predicted to be problematic. In the process, we can evaluate how well the models do relative to your traditional approach for directing review.
- Implement your own ML4QC workflow. If you have a data scientist on-staff, the ml4qc Python package and example Jupyter workbooks should be enough to get them started.
We hope that you’ll join the effort and help us to make ML-powered survey quality control a reality. Please reach out to crobert@orangechairlabs.com if you’re interested in discussing it.
P.S. New tool for Python data science workflows
BTW, if you use Python for your data science workflows, the surveydata package makes it easy to work with SurveyCTO and ODK servers and data. With it, you can:
- Load SurveyCTO and ODK export data — including text audit data — into Pandas DataFrames, with automatic encoding of dates, times, and other data types
- Sync data directly from a SurveyCTO or ODK Central server into local files or cloud storage, including encrypted data (currently supported cloud storage: AWS S3 and DynamoDB, Google Cloud Storage, and Azure Blob Storage)
- Convert text audit data into a convenient wide format for analysis
- Submit comments and reviews via the SurveyCTO review and correction workflow, to supplement manual review with automated analysis
For details, see the reference documentation.