On Testing AI Capabilities
Evaluations, also known as “evals”, are a critical, but nascent, part of society’s approach to AI.
Better testing of AI capabilities will be critical for national security and defense. Image source: Wikimedia.
In November 2023, tech companies including OpenAI, Google DeepMind, Microsoft and Meta signed voluntary commitments to have their latest models reviewed by the AI Safety Institute (AISI), a UK government agency. This was a world first. A key part of this agreement was the development of AI evaluations, known in the industry as evals: tests for AI capabilities in intelligence, coding performance, autonomy, and other characteristics and traits.
This article introduces the current state of the art in AI evaluation and sets out some considerations for what evaluations should look like. I expect all major governments will have approaches to AI model evaluation in place in the next few years; indeed, the current US administration has already made this a priority with its October 2023 AI Executive Order.
What Are Evals?
AI evaluations can be thought of as tests. As with human tests such as the SAT or Myers-Briggs, many different facets of an AI model can be tested:
IQ: How intelligent is the model?
Example test: Ask the AI to solve some well-known sets of puzzles.
A real-world example is GPQA, a set of 448 extremely difficult (PhD-level) multiple choice questions written by domain experts in biology, physics, and chemistry.
Capabilities: How capable is the model at performing various tasks (e.g., coding) autonomously?
Example test: Ask an internet-connected AI model to hack a user’s email account or set up its own Bitcoin wallet. How far can it get?
Real-world examples include coding ability (HumanEval), and many others.
Uplift: How much of an extra boost, or uplift, in performance does a person get from using the model for a certain task?
Example test: Assign a group of people the task of synthesizing a particular chemical and compare the performance of those who had access to the AI model with the performance of those who only had access to a normal internet search engine.
Bias: How biased is the model, in either direction? What “beliefs,” if any, does it seem to have?
Example test: There are multiple possibilities, depending on whether you are testing for political bias, gender bias, racial bias, and so forth.
Deceptiveness: How inclined is the model to lie, cheat, or deceive? Does the model exhibit any goals or drives? How easy is it to convince the model to do nefarious things?
Example test: Can the model be convinced to deceive a human person in an “in-the-wild” user test? (ChatGPT-4, for instance, succeeded at this.)
When models are released by AI labs, they typically include a “model card” showing how well the model performed on a battery of tests against other models. Most of these metrics, however, are subject to Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.” Companies train their models to score highly on the test, with the result that models look impressive on paper but aren’t flexibly intelligent in practice. Hence, most industry professionals still believe there is no substitute for spending some time using the model to gain a fuller picture of its capabilities.
Although nascent, evaluations are already used by private companies deploying AI models, because companies need to make sure they function predictably. In an infamous recent example of deployment gone wrong, a Chevrolet dealership deployed an instance of ChatGPT for customer support; within hours, users had convinced it to sell them a car for $1. As it turned out, these deals weren’t legally binding – luckily for the dealership, which took the bot down. Better evaluations would have prevented this.
In a national security context, automated evaluation of foreign AI capabilities acts as a kind of early warning system. Governments will need to rapidly understand how a new AI system developed by a rival state might enhance that rival’s capabilities. Is it powerful enough to automate a complex cybersecurity attack chain? Can it generate fake social media profiles, launch spear-phishing attacks, and so on? If not, how far might it get in the future?
Consumers, too, will place greater importance on evaluations, perhaps in the form of certifications – just as some consumers prefer organic vegetables or fair-trade coffee. One can even imagine consumers choosing AI models with particular political slants: we can already see early signs of these AI culture wars in the recent backlash against Google Gemini, which was depicting individuals of various races out of context in offensive ways.
What Should We Do?
Evaluations will inform the common framework and language with which we compare AI models, but there also will not be one “correct answer”, so a plurality of evaluation suites and approaches will be best. Since neutral evaluations are a public good, the government should play a significant role in growing the field.
The best analogy is cybersecurity: credible third party assessors need to be formed — and funded, via grants — in order to provide these evaluation suites. At least some of these organizations need to be based outside of Silicon Valley, to ensure diversity and criticality: so far, AI companies are tending to take the lead on evaluating themselves or are working closely with third parties to do this. The US House Committee on Science, Space, and Technology’s recent letter expressing concern about this issue is a good read in this regard. [1]
Some AI safety proponents have raised the question of whether these evaluations should be voluntary – as in, say, federal information processing standards (FIPS) or the Federal Information Security Modernization Act (FISMA), which are federal information security standards defined by NIST – or legally imposed, as in food safety regulations. Legally imposing or requiring any evaluations would be a mistake this early: rapid algorithmic progress and Moore’s Law are making it so that more capable models become available every few months. Because we are still learning and understanding how best to measure “intelligence,” baking in a set of evaluations too soon would hamper American competitiveness for no real gain. The best thing we can do at this stage is try several approaches, and learn.
Evaluations contain an intractable trade-off of transparency versus effectiveness: if you publish your evaluation on the internet, it becomes easy for the AI model to learn how to “pass the test,” because it becomes part of the AI’s training data. Thus, the most effective tests are going to be secret. This is in tension with the goal of transparency; for example, a report from the Financial Times indicates some early struggles with the UK evaluations, where the companies were chafing against the fact that these were slow and not transparent. National security and cybersecurity evaluations might be kept private for strategic and safety reasons, but other more consumer-facing sets of evaluations, such as ones related to bias, might be made more open so that consumers can be assured of what they’re getting.
Speed in developing these standards, is also important when we consider how close we are to the US presidential election, with deepfakes and AI-generated bots already a live issue. The UK AISI’s notice on February 9, 2024 shows one example where researchers “asked an LLM to generate a synthetic social media persona for a simulated social network which could hypothetically be used to spread disinformation in a real-world setting. The model was able to produce a highly convincing persona, which could be scaled up to thousands of personas with minimal time and effort.” Identifying these attack vectors and developing defensive infrastructure will be a critical part of societal resilience.
In the future, powerful AIs may be able to assist in assessing other powerful AIs, too – a concept known as scalable oversight — as evaluating model outputs becomes more challenging. This would be helpful in cases where an AI produced a mathematical proof or theorem too complicated for the average evaluator to follow. Developing and testing such AI evaluators is another exciting area of work.
In an international context, the idea from cryptography of zero knowledge proofs, where one party can mathematically prove something to another without revealing any sensitive information, may be relevant for AI weapons. Imagine a set of nation-states agreeing to a standard whereby a given AI model is cryptographically guaranteed to be not malicious or harmful in particular ways, without necessarily having to give away any classified details about the model itself.
Overall, the earlier we start coalescing around some initial standards, the better. US leadership and speed on getting AI evaluations in place is critical, and setting the standard globally for AI models should be an explicit policy goal.