Event Recap: Tests of Large and Small Language Models on Common Evaluation Tasks


Guest post by Pedro Prieto Martin

On March 12, 2025, the NLP Community of Practice’s Sandbox Working Group hosted a webinar featuring Gerard Atkinson, Director at ARTD Consultants (Australia), who presented findings from his research comparing the performance of various language models on standard evaluation tasks, such as qualitative text analysis and the use of rubrics to assess documents.

During the webinar, Gerard shared a metaphor about LLMs being like chef knives: tools that can work really well in the hands of those skilled to use it, but that can also make a lot of damage when mishandled. Image by Takafumi Yamashita via Unsplash.

Qualitative Text Analysis

Building on his 2023 research, Gerard presented an updated comparison of various text analysis methods, ranging from traditional human coding (that involves no automation), through a series of semi-automated, statistical, and Machine Learning approaches (including QAD Coding, Latent Dirichtlet Analysis, and BERTopic/KeyBERT), to a selection of Large and Small Language Models, (encompassing Claude, GPT4o, Llama 3.1, Phi3 and Whyhive). 

The research employed a systematic framework to evaluate both zero-shot approaches (where coders annotate text without using a pre-existing topic list) and guided approaches (where lists of topics are provided).

The study used an existing dataset containing diary entries, with human-tagged emotions and topics, which was considered more representative of evaluation qualitative data compared to other common training sets like Twitter or IMDB reviews. 

Key Findings on ML Text Analysis Methods

The performance of the different methods was measured using rubrics that were applied to six different dimensions:

  • Accuracy – success in matching original coding.
  • Speed – timeliness of results.
  • Automation – degree of independence from human intervention.
  • Ease of implementation – setup and execution complexity.
  • Efficiency of implementation – cost-effectiveness.
  • Efficiency of scale – marginal cost at scale.
A table with different colored boxes

AI-generated content may be incorrect.
Gerard Atkinson, 2025.
  • Accuracy: For guided classification, frontier models demonstrated performance consistent with human coding. Zero-shot classification remains more challenging, though some models performed surprisingly well. Notably, some small language models were approaching the performance of their larger counterparts, especially in guided scenarios.
  • Speed: As expected, automated methods significantly outpaced human coding. The frontier models (Claude and GPT-4o) demonstrated remarkable throughput at approximately 24,000 entries per hour. Even the slower SLMs outperformed human coding, which managed around 550-900 entries per hour, depending on the approach.
  • Automation: All ML approaches offered near-total automation, with only minor human oversight needed.
  • Ease of implementation: Significant variations existed, with web-based interfaces (GPT-4o, Claude) requiring minimal knowledge, while SLMs like Llama 3.1 and Phi3 demanded moderate API and programming knowledge.
  • Cost-effectiveness: Implementation costs ranged from $20-40 for web-based LLMs to $80-180 for methods requiring programming expertise. Humans are still quite cost-effective, relying on entry-level analysts using topic lists to guide the tagging.
  • Scalability: Nearly all ML approaches demonstrated near-zero marginal costs at scale, while human coding showed almost no economies of scale.

An interesting observation was that while implementation complexity varied significantly across methods, the efficiency gains at scale were substantial for nearly all automated approaches.

Rubric Analysis Using LLMs

In an innovative extension of his research, Gerard tested whether LLMs could themselves function as evaluators using predefined rubrics to assess how well a document meets a series of criteria. This experiment involved three distinct datasets:

  • Resumes in text form (from LiveCareer.com)
  • Recipe datasets (from Poznan University of Technology)
  • Presentation abstracts (from AES 2023)

For each dataset, three evaluation dimensions were defined, ranging from low complexity (e.g., counting qualifications or ingredients) to high complexity (assessing grammar or technical difficulty of the recipe). The LLMs were prompted to evaluate content against these rubrics using a Low/Medium/High rating scale.

Results and Limitations

While the LLMs produced evaluations across all dimensions, at much faster speed than humans do, the accuracy wasn’t yet at human levels for most criteria. The highest agreement rates were observed for more objective criteria (e.g., citations in abstracts at 90% match rate), while more subjective dimensions showed poor agreement (e.g., qualifications in resumes at only 10% match rate). Krippendorff’s Alpha scores were consistently low across all dimensions, indicating limited reliability compared to human evaluators.

A table with numbers and symbols

AI-generated content may be incorrect.
Gerard Atkinson, 2025.

Gerard mentioned a series of factors impacting the matching metrics, including the specificity of the evaluation rubric, breadth and variety in the data, LLM hallucination tendencies, and inherent subjectivity in certain evaluation dimensions.

Implications for Evaluation Practice

The research suggests several important implications for evaluation professionals:

  1. Machine learning approaches are approaching human quality for guided analysis and offering increasingly rich zero-shot analysis capabilities.
  2. Secure, offline approaches are reaching near-human accuracy levels but currently operate at slower speeds than cloud-based solutions.
  3. Third-party specialized solutions leveraging AI present competitive options for organizations with privacy or security constraints.

For rubric-based evaluation specifically, LLMs show promise as a starting point for identifying dimensions and scale points, and they can analyse information at much higher speeds than human raters. However, they have not yet reached human-level accuracy in applying evaluation rubrics consistently.

Gerard also highlighted research from Microsoft on AI’s impact on critical thinking, noting the importance of maintaining and developing evaluators’ critical thinking skills, information verification abilities, and task stewardship in an increasingly AI-assisted profession.

Discussion Highlights

During the Q&A session, several important points emerged:

  • Reliability concerns: Participants questioned how the probabilistic nature of LLMs affects the reliability of results with each new use on the same dataset. Atkinson acknowledged this limitation, noting that his research addressed this indirectly through human comparison benchmarks.
  • Context length limitations: A participant noted that SLMs typically have smaller context windows than LLMs, potentially creating an unfair comparison for longer documents. This remains a challenge for SLM deployment in real-world evaluation scenarios.
  • Back-translation validation: One suggestion from the audience was to explore “back translation” as an alternative accuracy testing method, which could provide additional validation beyond direct comparison to human coding.

Join the Community

To participate in future events and access recordings from this and other sessions, join the NLP Community of Practice at merltech.org/nlp-cop/. The CoP brings together over 1,000 development and humanitarian practitioners working at the intersection of AI, digital development, and MERL (Monitoring, Evaluation, Research, and Learning).


This event was organized by the NLP-CoP Sandbox Working Group, which works to identify, test, and compare tools and applications; collaborate on open-source NLP for MERL approaches; and explore the technical aspects of LLMs and GenAI applications.

1 comment

  1. By Tatek Deneke

    This is a good experimental research work, well done Gerard Atkinson, As this might lead to a supportive technology for evaluation tasks
    I have a question on implementation part, what do you mean Third-party specialized solutions?

    Furthermore in most evaluation tasks(project endline evaluation and midterm evaluation) guided approaches (where lists of topics are provided) is used in some cases organizations demand an specific criteria; for example the OECD/DAC evaluation criteria provide a framework for assessing the effectiveness, efficiency, relevance, coherence, and sustainability of development assistance projects and programs. These criteria are widely used in development project evaluation and are considered comprehensive, covering all aspects of development intervention.

    So in evaluation tasks where the client for which the evaluation is done demand OECD zero-shot approaches (where coders annotate text without using a pre-existing topic list) looks not relevant in qualitative data analysis.

    However regardless of the above limitations I argue that zero-shot approaches may have application in baseline assessment for program evaluation and basic exploratory research on new and emerging topics.

    Regarding size of the documents to be analyzed in most program evaluation tasks some organazations demand minimum 50 page evaluation reports, so this might be limitations SLM in practical evaluation tasks.

    Finally I appreciate Gerard Atkinson research work as it is so innovative.

    The author was suppose to suggest further research on the language models,(LLM, SLM) fie example using quais experimental research design. Or based on limitations of his research.

    Regards,

    Tatek Deneke

Leave a Reply

Your email address will not be published. Required fields are marked *