Join us on March 12th: “Tests of Large and Small Language Models on common evaluation tasks”, a webinar with Gerard Atkinson
On March 12th, the Sandbox Working Group at the Natural Language Processing Community of Practice (NLP-CoP) is hosting a webinar on the performance of machine learning models on tasks including topic classification and rubric-driven analysis with Gerard Atkinson.
The last two years saw the rise of large language models, with words like Chat-GPT and Claude becoming part of common discussion. The evaluation community was not immune to this trend, and papers that looked at just how well machine learning approaches could do against human evaluators on topics such as qualitative analysis and evaluative judgement were published. The answer? Not as well as you would think (but you could get wrong answers faster than ever!). However, newer and more sophisticated tools have become available, including reasoning models that improve performance on problem-solving tasks. In addition, there have been innovations in hybrid models which combine the best features of different methods while minimising their weaknesses. Coupled to this is the growing field of standalone models that can be run on a desktop computer but produce responses that match or exceed cloud-based models, and models that can draw on rich contextual information (such as documentation or full interview transcripts) to make decisions.
To explore all of those questions and more, the Sandbox Working Group is inviting Gerard Atkinson for a webinar on the latest findings in relation to the performance of machine learning models on tasks including topic classification and rubric-driven analysis.
During this webinar, Gerard will share a presentation of his findings in this space, which will be followed by a Q&A and discussion of emerging applications and issues such as ethical considerations and managing stakeholder expectations around the use of AI as part of evaluation. You’re welcome to join us!
About our speaker
Gerard Atkinson is a Director with ARTD Consultants, an Australian policy and program evaluation firm. He is an experienced evaluator and project manager who has delivered work for clients in Australia, Europe, and the United States. Gerard has worked with AI and machine-learning approaches in the business, government and non-profit sector for over 10 years and has delivered research on the applications of AI to evaluative approaches including qualitative data analysis, natural language processing, and rubric synthesis and application. This work with AI and rubrics intersects with his research work on the application of rubric approaches to program and policy evaluation, including as a method for characterising research impact across a range of projects.
He is a current member of the Australian Evaluation Society (AES), a Qualified Professional Researcher of The Research Society (TRS), and a Graduate of the Australian Institute of Company Directors (AICD). He is also a non-executive director of the Social Impact Measurement Network Australia (SIMNA). Gerard holds a Graduate Certificate in Autism, University of Wollongong (2021); Master of Business Administration, Southern Methodist University (2015); Master of Arts (Arts Management), Southern Methodist University (2015); Bachelor of Science (Hons), Australian National University (2006)
Register to join
Register here to join our webinar on March 12, at 11 am UTC. We’re looking forward to seeing you there!
Join the NLP-CoP: You’re welcome to join the NLP-CoP convened by the MERL Tech Initiative. As well as gaining access to resources and events at the intersection of MERL and NLP, you can join a wide range of working groups.