How can evaluators use NLP? 4 demos and a discussion at the AEA Conference
On October 12th at the American Evaluation Association conference, members of the Natural Language Processing Community of Practice (NLP-CoP) ran a standing-room-only session on how Large Language Model (LLM) tools (e.g. ChatGPT, Bard, LLaMa) bring incredible opportunities to scale qualitative and text analysis in evaluations. The focus of the session was on hands-on applications of LLM to situations that evaluators might regularly face in their work: in 5-minute demos, we talked participants through four of such ‘evaluator-life’ scenarios, before breaking into discussion groups where we discussed the biases, pitfalls, and practical difficulties that evaluators might face when using them.
Stephanie Coker kicked us off with a presentation on NLP. She highlighted that, broadly speaking, NLP can help evaluators do three types of tasks: information retrieval, sentiment analysis and conversational chatbots. She described how LLMs handle each type of task by serving as a pre-trained transformer along an information conveyor belt that produces an answer based on its predictions on what is likely to be the next. With these features now available to us on an enormous scale, LLMs are aids in the analytical process, acting as consolidated sources of insight and thought partnership. LLMs also simplify the capture, analysis, and reporting of numerous stories, opening avenues for further research exploration.
Stephanie however noted that like any machine, LLMs must be pre-trained on the right data and continually fine-tuned to yield the best results. As a result of these requirements, LLMs are commonly subject to various biases, including representation, semantic, label, and historical biases. She concluded that it is thus essential to approach LLMs critically, recognizing that they are not universal solutions. To ensure responsible use, it is also crucial to develop and adhere to common principles and guidelines for evaluation.
Following Stephanie’s introduction, we presented our audience with four real examples of how to apply NLP in evaluation work, as well as cautions on different approaches.
Demo 1. Using ChatGPT as a thought partner
The demo: Linda Raftree shared ideas on how to use ChatGPT-4 as a thought partner for writing tasks. (See this document for details). Using a video transcript pulled from YouTube, she showed how to clean up and summarize the key points from the transcript. She then asked ChatGPT to write a session submission for the AEA. Linda emphasized the importance of not uploading personal or proprietary information to ChatGPT as it’s unclear how privacy is handled by the platform. She also pointed out a possible instance of bias: the video she used for the demo was of Corey Whitmore, the AEA president, who is female. In the summary, ChatGPT assumed that Corey was male.
The discussion:
- Biases: People wondered how these tools handle intersectionality. Since they are trained on the Internet and Google Books, most of which have been authored by White men, there is deep bias in the underlying data. Some creators of large language models are making an effort to create a more neutral model, yet there is still a lot of bias and the tools don’t generally output data that is equitable. The tools currently work much better in English because the English language data sets are larger.
- Objectivity: ChatGPT defaults to a centrist tone and refuses to provide opinions. Do we want it to be opinion based or to provide us with data on which humans can make decisions? It’s still hard to know what the biases might be in this data with which we would make decisions. Some say AI is objective than humans, but algorithms are designed by humans, created using data created by humans that is then tagged by humans, so we are constantly inputting bias.
- Citations: The tools are not great at citing their work. ChatGPT convincingly fabricates citations, for example. These fake citations are possible because these models are not search engines, they are correlational models. They predict the next word in a sentence, and use the text from their training data. So they use existing real citations to create new, imaginary citations that seem related to the topic at hand.
- Validation of outputs: Claude – another large language model – has a feature that allows you to trace where it’s drawing conclusions from – Chat GPT doesn’t do that, so it’s hard to know how it arrives at a particular output. When using ChatGPT, you can ask it questions about how it arrived at its conclusions as a way to validate; e.g., you can ask “why did you come up with that?” “where did you find the conclusion?” and it will tell you on what text it based its conclusion.
- Importance of precise “prompt engineering”. To get better outputs, it’s important to be very specific about what you want the Chatbot to do. For example, if aiming to conduct sentiment analysis about gender, the prompter needs to spell out what is meant by “traditional gender norms” in order for the chatbot to provide the sentiment analysis. Knowing how to frame a task for a chatbot is key.
- Need for research and upskilling: There are lots of remaining questions about how these tools work and how evaluators might use them. More documented experimentation and good practices are needed. Internal Review Boards might want to consider how to use AI to support their review processes. They will also need training and upskilling to address the ethical questions arising from the use of these tools. The World Bank has conducted side-by-side tests of different models doing the same tasks to determine how well they perform individually and comparatively and how they stand up against humans. MERL Tech’s NLP-CoP focuses on unpacking these kinds of issues.
Demo 2. Using ChatGPT as co-pilot for an LLM analysis pipeline in Python
The demo: A key problem with using chatbot tools – like ChatGPT – in evaluations right now is that interactions with them are not easily scalable: text needs to be copy-pasted or typed into the chatbot user interface manually. A way around this is to use publicly accessible LLM models (e.g. via Hugging Face) in a Python code environment, and apply them to full datasets of text. But what if I am not an expert on Python code? Use ChatGPT as co-pilot!
Paul Jasper showed how ChatGPT can help to write code in a Google Colab environment to access a summarisation LLM model from HuggingFace, to apply summarisation to several rows of text in an Excel document, and to then create a new column in that same Excel document to that shows the summarized text as an output. All of this can be done with limited understanding of Python and in under an hour. The crucial take-aways from this demo are that ChatGPT-type co-pilots have significantly reduced the threshold to coding and that LLMs can easily be built into analysis flows.
The discussion:
- Building your own analysis pipeline can help evaluators get around data protection concerns: running a model from HuggingFace on your own machine means that no data needs to be shared with a cloud-based platform.
- Fine-tuning of these ‘off the shelf’ models is getting easier, but still requires resources.
- Human validation (validate, validate, validate) is still crucial.
- Finally, there was a discussion about how collaborating on fine-tuning models for specific evaluation tasks would be helpful.
Demo 3. Quantitative Analysis with R
The demo: In light of ongoing discussions about using LLMs for qualitative analysis, Stephanie was keen to explore LLMs use for quantitative analytical methodologies, specifically those executable in R. Recognizing the versatility of LLMs as a thought partner for data analysis, she posed several queries to ChatGPT, seeking guidance on implementing robust statistical techniques in R for analyzing variables in dummy datasets. Beyond immediate analytical tasks, she also showed how ChatGPT can assist with finding publicly available data for use in analysis, comparing variables across different datasets.
The discussion:
- Data privacy and governance is still an issue, even for LLMs versions that can be downloaded onto a hard drive. Some participants noted that they were not able to explore LLMs because of firewalls that exist for larger private and public institutions, suggesting that LLMs remain a source of vulnerability for their systems.
- In order to fully account for biases, there is a need for further information about how LLMs work beyond general descriptions of function. Many evaluators (like other groups of researchers) also felt that certain standards need to be established to prevent the potential for harmful use.
- There was also a sense that an evaluation organization/team might have to be fairly data mature to be able to effectively leverage LLM tools. Stephanie referenced the Data Maturity Assessment tool by data.org which social sector organizations can use to assess their readiness for applying different technology.
- There are other tools such as Nvivo that do sentiment analysis specifically for research and analysis. However, translation and transcription is a great asset that LLMs now add to the mix with us now able to translate text quickly into lesser known languages.
Demo 4. Summarizing Qualitative Data in ChatGPT without Python
The demo: Kerry Bruce showed how ChatGPT isn’t really ready for qualitative data analysis through cut and paste scenarios, except when using publicly available data and very small datasets (which might just as easily be analyzed by a human). The demo addressed a common evaluation task of needing to analyze a large volume of evaluations briefs. It showed how the data needs to be prepared to be used with ChatGPT, the limits of what can be done and how the data need to be cleaned. A number of possible prompts were provided and the results were shown in real time. (Try it yourself – the dataset is available here and a deck with details from the presentation is available here).
Key takeaways from this demo were
- Only relatively small files can be provided to ChatGPT to summarize directly in the window
- Beyond small amounts of textual data – you need to use Python and an API to provide the data
- Data cleaning and preparation is essential, otherwise the machine will easily misinterpret
- ChatGPT can be a great time saving device, but probably only faster to use if you are working with large datasets.
- Be careful what data you are putting into the system – private data should not be shared.
The discussion:
- Social Desirability Bias: Evaluators need to be aware of the social desirability bias of ChatGPT when interpreting its responses. Generally the LLM has been programmed to provide an answer and will do so, whether it knows the answer or not.
- Citations: There was a robust discussion around citations – both how ChatGPT generates false and non-existant situations when asked, but also how and when we should cite ChatGPT as a source or thought partner in our work. There is also a large discussion about how ChatGPT (ab)uses intellectual property and who actually owns the products of its work.
In conclusion
As presenters we were thrilled to have such a great turnout. The break out groups contributed to rich discussions about how emerging AI tools are sure to affect our field and the kinds of cautions we should exercise as we begin to use them!