Automated Capability Discovery (ACD) is an innovative tool designed to automatically identify surprising new capabilities and failure modes in foundation models. This is achieved through a process known as "self-exploration," wherein models explore their own abilities.
Leadership: ACD is spearheaded by @cong_ml and @shengranhu.
ACD generates a concise "Capability Report" that outlines the discovered capabilities and failure modes. This report facilitates quick inspection and easier dissemination of results, as well as the ability to flag issues prior to deployment.
ACD mimics the exploration conducted within the community by utilizing a dual-model approach: one model acts as the scientist while the other serves as the subject. ACD continuously generates tasks—written in code with automated scoring—that probe for new capabilities or weaknesses. These tasks range from simple string games to complex puzzles.
Before deployment, ACD assists developers in identifying areas where a model consistently fails or behaves unexpectedly. This information is crucial for building safer and more robust AI systems. For examples of the insights discovered by GPT-4o, please refer to the first tweet in this thread.
We have also investigated various scientist/subject pairings. Results obtained with Llama3-8B as the subject, in contrast to GPT-4o, have revealed unique failure modes and emerging skills, which provide intriguing insights into each model’s capabilities.
Different scientist models have shown the ability to uncover wildly creative and out-of-the-box probes. For instance, Claude Sonnet 3.5 discovered that GPT-4o can successfully design alien communication protocols.
Human surveys have confirmed that a majority of the tasks generated by ACD are clear and valid. Furthermore, the automated scoring system closely aligns with human judgment for all tasks, except for the most complex ones.
ACD represents a significant advancement in the field of AI, enabling more effective exploration and understanding of model capabilities and limitations.
Generated by tweet-to-markdown