New Research: Approximating Language Model Training Data from Weights

Introduction

This document outlines the findings of recent research focused on understanding the amount of information available in open-weights models, specifically the DeepSeek R1 weights, which amount to 1.2 TB. The central question of this research is: What can we learn from all those bits?

Methodology

Our approach involves a novel method that reverses the fine-tuning of large language models (LLMs) to recover data. The following images illustrate key concepts of our methodology:

To implement this method, two sets of model weights are required: the initial model and a fine-tuned version. This requirement is feasible as open-weights models often provide two checkpoints.

Data Selection Process

Rather than generating data from weights in a one-shot manner, we adopt a more sophisticated approach. We select data from the web using gradients that align with the model differences.

Algorithm Complexity

The algorithm we developed is intricate, primarily due to the challenges associated with computing per-example gradients at scale. To enhance efficiency, we implemented the following improvements:

Gradient computation using vmap
Utilization of last-layer gradients (which remain substantial in the case of LLMs)
Projection of gradients to a smaller dimension

Results

The efficacy of our method, referred to as SELECT, was tested against various baselines, including data selection based on likelihood and top-k per-class methods. Remarkably, our method achieved comparable performance to traditional approaches, relying solely on model weights and web data.

Additional Findings

Our research revealed several noteworthy characteristics of the SELECT method:

It achieves good performance with minimal data.
It successfully recovers embeddings that resemble the original dataset.
When a portion of the original data is accessible, the method tends to prioritize that data for selection.

Observations on Optimization Algorithms

An interesting observation from our research is that the AdamW optimization algorithm appears to "obfuscate" weights more effectively than Adam or standard SGD. This phenomenon may be attributed to the nonlinear effects of weight decay during optimization, which reduces the amount of information available in the weights.

Conclusion

I am thoroughly convinced that there exists a significantly greater amount of information within model weights than we currently recognize. Unfortunately, we lack the appropriate tools to extract this information effectively. In many respects, this information is highly compressed, and we are still in the early stages of developing a suitable decompression mechanism.

This research represents a modest yet crucial first step in this domain.

Further Exploration

For those interested in this area, there is concurrent research that explores the problem of inferring training distributions from LLMs. For additional insights, please refer to the following link: Concurrent Research.

Acknowledgments: Special thanks to @chhaviyadav_ and @kamalikac, among others, for their contributions to this field.

Generated by tweet-to-markdown

josherich/a.md