Universal and Transferable Adversarial Attacks on Aligned Language Models
Large language models (LLMs) are typically trained on massive text corpora scraped from the
internet, which are known to contain a substantial amount of objectionable content. In an attempt to make AI systems better aligned with human values. Read more
You might also like
-
What’s happening with GenAI Ethics and Governance?
-
Join the AI and African Evaluation Working Group Meet ‘n’ Mix Session on May 7!
-
Hands on with GenAI: predictions and observations from The MERL Tech Initiative and Oxford Policy Management’s ICT4D Training Day
-
When Might We Use AI for Evaluation Purposes? A discussion with New Directions for Evaluation (NDE) authors