Skip to content

Instantly share code, notes, and snippets.

@josherich
Created April 29, 2025 14:26
Show Gist options
  • Save josherich/bae160020c6c05821a568ac6b8b48361 to your computer and use it in GitHub Desktop.
Save josherich/bae160020c6c05821a568ac6b8b48361 to your computer and use it in GitHub Desktop.
Automated Capability Discovery

OP

Introducing Automated Capability Discovery (ACD)

Automated Capability Discovery (ACD) is an innovative tool designed to automatically identify surprising new capabilities and failure modes in foundation models. This is achieved through a process known as "self-exploration," wherein models explore their own abilities.

Leadership: ACD is spearheaded by @cong_ml and @shengranhu.

Automated Capability Discovery

Capability Reporting

ACD generates a concise "Capability Report" that outlines the discovered capabilities and failure modes. This report facilitates quick inspection and easier dissemination of results, as well as the ability to flag issues prior to deployment.

Capability Report

Self-Exploration Mechanism

ACD mimics the exploration conducted within the community by utilizing a dual-model approach: one model acts as the scientist while the other serves as the subject. ACD continuously generates tasks—written in code with automated scoring—that probe for new capabilities or weaknesses. These tasks range from simple string games to complex puzzles.

Self-Exploration Mechanism

Pre-Deployment Insights

Before deployment, ACD assists developers in identifying areas where a model consistently fails or behaves unexpectedly. This information is crucial for building safer and more robust AI systems. For examples of the insights discovered by GPT-4o, please refer to the first tweet in this thread.

Model Pairing and Insights

We have also investigated various scientist/subject pairings. Results obtained with Llama3-8B as the subject, in contrast to GPT-4o, have revealed unique failure modes and emerging skills, which provide intriguing insights into each model’s capabilities.

Model Pairing Insights

Creative Probes

Different scientist models have shown the ability to uncover wildly creative and out-of-the-box probes. For instance, Claude Sonnet 3.5 discovered that GPT-4o can successfully design alien communication protocols.

Creative Probes

Human Validation

Human surveys have confirmed that a majority of the tasks generated by ACD are clear and valid. Furthermore, the automated scoring system closely aligns with human judgment for all tasks, except for the most complex ones.

Human Validation


ACD represents a significant advancement in the field of AI, enabling more effective exploration and understanding of model capabilities and limitations.

Generated by tweet-to-markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment