This is not just an NP problem. It is worse: the absolute minimum description is generally uncomputable.
But between an ordinary ZIP file and God's unreachable archiver lies a huge territory that LLMs, program synthesis, and computational agents are already beginning to enter.
This article uses three clear labels.
- FACT — a mathematical theorem, an accepted physical theory, a published result, or a documented technological capability.
- INTERPRETATION — a meaningful engineering connection between facts. It may be useful, but it is not a separate proven theorem.
- SPECULATION — a philosophical or artistic model that does not yet have an accepted experimental test.
Without these labels, it is easy to make an unjustified jump:
“The brain compresses experience” → “consciousness creates matter” → “the Universe is a thought in the mind of God.”
The first sentence may be a useful scientific analogy. The last two are already metaphysics. They may be beautiful, but beauty is not a measuring instrument.
FACT.
Before Shannon, communication was studied as many separate engineering problems: telegraphy, telephones, noise, and signal coding. In his 1948 paper A Mathematical Theory of Communication, Shannon exposed the common structure behind them.
He proposed separating:
- the source, which chooses a message;
- the encoder;
- the channel, which may contain noise;
- the decoder;
- the receiver.
The meaning of the message is not necessary for the channel. A cable does not need to understand a declaration of love, a bank payment, or source code. It only needs to distinguish one possible signal sequence from another.
The entropy of a discrete source is defined as:
Where:
Xis a random variable: the next message or symbol;xis one possible result;p(x)is the probability of that result;log_2is the logarithm with base 2;H(X)is the average amount of information in bits.
If the source chooses one of N equally likely options, the formula becomes:
Two equally likely options require one bit. Eight options require three bits.
Shannon did not prove that “every file contains an objective amount of information.” His claim was more precise:
For a given probabilistic source, no uniquely decodable code can consistently use fewer bits on average than the entropy of that source.
The key words are: given source, probabilities, and on average.
Source: Claude Shannon, A Mathematical Theory of Communication (1948).
FACT.
A single string does not tell us what process created it.
Consider this sequence:
314159265358979323846264338327950288419716939937510...
It could be:
- a random sequence of digits;
- the beginning of the number
π; - part of an encrypted message;
- the result of damaged memory;
- a fragment of a table copied by hand.
The file is the same. The sources are different. Therefore, the probabilities used to calculate entropy are also different.
If a model treats the digits as independent and equally likely, each decimal digit requires:
But if the decoder already knows, “these are the first N digits of π,” the message can be replaced by a program that computes π and the number N.
Shannon entropy has not been broken. We changed the source model.
This is the first important blow to the naive picture:
Data does not carry a label saying how much information it truly contains. The amount depends on the set of possible messages, their probabilities, and what the decoder already knows.
FACT.
Because once the model is fixed, the result becomes strict.
Suppose the real source follows distribution p, while the archiver assumes distribution q. Then the ideal code length for a particular result x is approximately:
Where:
q(x)is the probability assigned by the archiver;l_q(x)is the number of bits the archiver spends to encode resultx.
The average coding cost is:
Where:
pis the real distribution of the data;qis the archiver's model;H(p)is the entropy of the real source;D_KL(p || q)is the extra cost caused by model error;Emeans the average over messages actually produced by the source.
The Kullback–Leibler divergence D_KL is never negative. A bad model always pays a penalty.
This is the real role of arithmetic coding or ANS:
They do not discover the meaning of the text or invent a short description. They convert already calculated probabilities into a bit sequence with almost no extra loss.
More precisely:
- an LLM or another model predicts probabilities;
- an entropy coder records the observed symbols at a cost close to
-log2(q); - the improvement comes from the model, not from any magic inside the arithmetic coder.
FACT.
In his 1951 paper Prediction and Entropy of Printed English, Shannon studied how well the next letter of English text can be predicted from earlier letters.
After q, the next letter is almost always u.
After the, a space or a noun is more likely than a random sequence of symbols.
After several sentences, the reader already knows the topic, style, and likely continuations.
The conditional entropy of the next symbol is:
Where:
X_nis the next symbol;X_1 ... X_(n-1)is the known context;- the vertical bar means “given that the context is known.”
The more useful context we have, the less uncertainty remains.
Shannon used successive statistical approximations and experiments in which people guessed how a text would continue. He obtained low estimates for the entropy of English—around one bit per letter for some conditions and types of text. This was an estimate, not a universal constant of the English language.
Source: Claude Shannon, Prediction and Entropy of Printed English (1951).
FACT + INTERPRETATION.
An autoregressive language model receives a sequence of tokens and estimates the probabilities of the next token:
Where:
x_1 ... x_nare the tokens already read;x_(n+1)is the next token;Pis the probability distribution over possible continuations.
After a token is chosen, the process repeats.
A modern model considers much more than neighboring letters. Its parameters statistically capture grammar, program structures, text genres, connections between concepts, common facts, and patterns for solving problems.
Shannon did not foresee the Transformer, training on enormous datasets, or agent tools. But he formulated the problem that such models now solve at scale:
Predict the continuation as well as the structure of the previous context allows.
Sources:
- Vaswani et al., Attention Is All You Need (2017)
- Brown et al., Language Models are Few-Shot Learners (2020)
FACT.
Shannon does not answer this question:
What is the length of the shortest possible explanation of this particular object?
The Kolmogorov complexity of a string x, relative to a universal machine U, is:
Where:
Uis a chosen universal computing machine or language;pis a program;U(p)=xmeans that the program prints stringxand stops;|p|is the length of the program in bits;K_U(x)is the length of the shortest such program.
A billion zeroes can be produced by a short loop.
The first billion digits of π can be produced by a relatively short algorithm plus the number N.
A random billion-bit string almost certainly has no much shorter program.
The choice of universal machine changes the complexity only by a constant amount:
The constant c_(U,V) depends on the two languages U and V, but not on the string x.
Source: A. N. Kolmogorov, Three Approaches to the Quantitative Definition of Information (1965).
INTERPRETATION.
Because the shortest program looks like the “true idea” of the object.
An image may contain a billion pixels, but its short explanation could be:
Draw a black circle with radius 100 on a white background.
A novel may contain millions of characters, but part of its structure can be explained by its language, genre, historical period, and the author's style.
The physical history of a region of space may be enormous, but perhaps short laws and a compact initial state can compress it.
This creates a temptation to treat the following as the same thing:
- a short program;
- a causal explanation;
- the essence of an object;
- “God's thought” about the object.
Mathematics guarantees only that a shortest description exists relative to a universal machine. It does not guarantee that the description will be understandable, unique, causal, beautiful, or physically fundamental.
FACT.
We can find a short program and prove an upper bound:
But in general, we cannot prove that the program is the shortest one.
To test every shorter program, we would need to know which programs will stop, which will never stop, and which will stop only after an unimaginable number of steps. There is no universal solution to the halting problem.
So exact Kolmogorov complexity is not merely very expensive. In general, it is uncomputable.
This is not just NP-hardness. An NP-hard problem is still computable: with unlimited time, we can search through the possibilities. For Kolmogorov complexity, no algorithm can guarantee the exact answer for every input.
The absolute ideal of compression is mathematically defined, but no general archiver can guarantee that it has reached that ideal—or even know when it has reached it.
FACT.
No.
There are exactly 2^n possible input strings of length n. The number of bit strings shorter than n is:
For lossless compression, different inputs must have different compressed representations. There are not enough shorter outputs for every possible input.
Every archiver compresses some files, leaves some almost unchanged, and must make some files larger.
INTERPRETATION, technically possible in a limited form.
Here is the real mind-bending idea.
A normal archiver has a fixed set of methods. An LLM-based meta-archiver could create its own restoration program for each individual file.
It would not only choose a compression level. It would ask:
What short executable object can produce exactly these bytes?
Let the original file be called x. The system searches for a program p and a small remainder r such that:
x = P(R(p), r)
Where:
R(p)is the result of running the generated programp;ris an exact correction if the program did not fully restore the file;P(y, r)is a deterministic function that applies correctionrto resulty;xis the original file, restored byte for byte.
The full cost of the archive is:
Where:
|p|is the size of the generator program;|r|is the size of the remainder or binary patch;|m|is the metadata: runtime version, parameters, and lengths;|v|is the cost of the validator or a reference to a shared standard;Lis the final archive size.
The system accepts a candidate only after a strict check:
SHA256(decompress(archive)) == SHA256(original)
The LLM is needed during compression, when it searches for ideas. It may not be needed at all during decompression if the search produced a small, ordinary, deterministic decoder.
This point is essential:
A huge model may spend trillions of operations searching for a tiny program. The archive needs to contain the program it found, not the whole search process.
The cost of search time and the length of the description are different quantities.
INTERPRETATION.
A simplified design:
Input: file x
1. Run normal archivers and get a baseline size.
2. Detect the likely file type and structure.
3. Ask the LLM to suggest families of generators:
- a formula;
- a program;
- a template plus parameters;
- a database plus a schema;
- procedural graphics;
- a dictionary;
- a source-code model;
- a generator of repeated blocks.
4. For each hypothesis:
a. generate a program p;
b. run p in a sandbox;
c. compare result y with original x;
d. create a remainder patch r = Diff(y, x);
e. calculate |p| + |r| + metadata.
5. Modify the best programs:
- simplify them;
- replace tables with formulas;
- extract repeated parts;
- find symmetries;
- synthesize shorter functions;
- test other languages and virtual machines.
6. Save the shortest fully verified result.
Conceptual TypeScript pseudocode:
function metaCompress(input: Uint8Array): Archive {
const baseline = compressWithStandardCodecs(input);
let best = baseline;
const hypotheses = proposeGeneratorFamilies(input);
for (const hypothesis of hypotheses) {
const programs = synthesizePrograms(input, hypothesis);
for (const program of programs) {
const generated = runInSandbox(program);
const residual = createBinaryPatch(generated, input);
const archive = packageArchive(program, residual);
if (
archive.byteLength < best.byteLength &&
bytesEqual(decompress(archive), input)
) {
best = archive;
}
}
}
return best;
}This is not a finished industrial algorithm. But its parts already exist:
- LLMs generate code;
- program synthesis searches for expressions from examples;
- superoptimizers make programs smaller;
- SAT/SMT solvers check constraints;
- e-graphs search for equivalent expressions;
- sandboxes run candidate programs;
- diff algorithms encode the remainder;
- a cryptographic hash confirms exact restoration.
INTERPRETATION.
We can build several levels:
LLM controller
↓ writes
decoder generator G
↓ writes
decoder D_x for one specific file
↓ produces
approximation y
↓ patch r corrects it into
original file x
Formally, the composition can be written as:
x = P(U(G(s)), r)
Where:
sis a short task description or a set of parameters;Gis a program that creates a specialized decoder;Uis a universal execution environment;U(G(s))is the result of running the created decoder;ris the remaining correction;Pis the patching function.
Recursion does not create information from nothing. Every special instruction, parameter, and remainder must still be paid for in bits.
But several levels can still be useful because:
- one general generator
Gcan be used for many files; - a small
scan select a specific decoder; - repeated structure can be expressed at a higher level;
- a decoder can create a specialized decoder for the next layer.
Compilers already do something similar: source code produces machine code, and machine code produces behavior. Parser generators write parsers. Metaprogramming writes programs.
The new part is using an LLM as a heuristic explorer of this space of constructions.
FACT + INTERPRETATION.
An LLM meta-archiver can keep improving an upper estimate of Kolmogorov complexity:
It may find a program shorter than ZIP, Zstandard, Brotli, or a neural codec.
But in general, it cannot say:
“This is definitely the shortest possible program. No shorter program exists.”
Even if the system searches through programs, writes programs that search through programs, and improves itself, the halting problem remains.
It can move closer to Kolmogorov's ideal, but it cannot receive a certificate that it has reached the absolute minimum.
It is like going down a mine with no known bottom:
- every level found is real;
- the next level may be deeper;
- there is no sign saying “bottom.”
INTERPRETATION.
For some kinds of data, yes.
The best candidates are objects with a short generating description that a normal codec does not recognize:
- a table of values from a known function;
- a file almost completely generated by short source code;
- a procedural texture;
- a geometric scene;
- a repeated log with a hidden pattern;
- a database dump restored from a schema and a few parameters;
- a machine-generated document;
- a set of configurations that differ by only a few rules;
- an image of a diagram that is easier to draw again than to store pixel by pixel.
Example:
Normal PNG: 400 KiB
Short SVG program: 3 KiB
Remaining patch for an exact match: 20 KiB
Total: 23 KiB
But for cryptographically random, already compressed, or encrypted data, there will be almost no improvement.
The meta-archiver may also be extremely slow. Compressing one file may require hours, years, or an unacceptable amount of energy.
A new trade-off appears:
Kolmogorov complexity measures the length of the program, but not the time needed to find or run it. A practical archiver must consider both resources.
FACT.
No.
It can beat a particular fixed archiver on a particular set of files because it uses a richer class of models.
But:
- the size of the generated program is included in the archive;
- if the LLM itself is needed for decompression, its weights must either be shared by both sides or included in the cost;
- the meta-archiver will lose on some files;
- the average limit for the given true source remains;
- the impossibility of compressing every string remains.
For one individual file, the result may be far below a naive entropy estimate. This does not break the theory. It means the system found a model that the earlier estimate did not include.
FACT + INTERPRETATION.
It depends on the protocol.
The model searches for a small independent program. The receiver gets only the program and the patch. In this case, the LLM weights are not part of the archive: they are a computational tool used by the compressor, like a powerful server or a human mathematician.
Then the receiver must have the same model, version, tokenizer, parameters, and exact execution environment.
If the model is already shared by millions of archives, its cost can be spread over them. If it must be installed for one file, an honest description length must include the model.
For example:
model://gpt-5.5-2026-04-23
runtime://python-3.14
library://numpy-X.Y
Then the archive is short only relative to shared infrastructure. This is not cheating, but the condition must be stated clearly.
The Russian phrase Война и мир is a very short reference for a person who knows Russian and has the book in a library. For an alien with no language and no library, it is almost useless.
FACT.
According to OpenAI's official documentation, GPT-5.5 is designed for complex professional work and programming. The API model supports adjustable reasoning effort, function calling, structured outputs, and tools including web search, file search, a code interpreter, hosted shell, patch application, and computer use.
The officially listed capabilities include:
- a context window of up to 1,050,000 tokens;
- text and image inputs;
- up to 128,000 output tokens;
- reasoning-effort modes;
- tool use and multi-step work.
Sources:
At the same time, OpenAI does not publish the model's full architecture, exact parameter count, complete training-data composition, or every detail of training.
So an honest description has two layers:
- the known general principles of GPT-like models and the officially published capabilities;
- the unknown internal details of this particular closed model.
INTERPRETATION.
When the model replaces one hundred almost identical functions with one parameterized function, a table of numbers with a formula, a manual set of rules with a finite-state machine, or repeated JSON with a schema and generator, it is searching for a compact generating description.
With tools, the cycle begins to look like experimental science:
hypothesis
→ program
→ execution
→ test
→ counterexample
→ corrected program
GPT-5.5 can read a large code context, suggest an architecture, write an implementation, run tests through a tool, see the error, change the program, and compare alternatives.
But this is not the calculation of K(x).
The model uses a learned distribution of likely programs. It searches where human culture has already left paths. An unknown short program may lie outside its familiar distribution.
GPT-5.5 is not a Kolmogorov archiver. It is a powerful heuristic that gives useful upper estimates of complexity in areas where its training and tools provide a good prior.
INTERPRETATION.
In a limited engineering sense, yes.
It can:
- analyze failed cases;
- write new transformations;
- choose new representation languages;
- create test sets;
- profile decoders;
- replace slow parts;
- select specialized models;
- store successful generators in a shared library;
- use that library as a dictionary for future files.
This creates an evolving system:
C[t + 1] = I(C[t], F[t], N[t])
Where:
C[t]is the compressor at stept;F[t]is the set of files or tests where the current version failed;N[t]is the set of new models, rules, and transformations;Iis the procedure that changes and tests the compressor;C[t + 1]is the next version of the compressor.
But improvement is not guaranteed forever. The system may overfit to the tests, make the shared decoder too large, find false patterns, or spend more energy than it saves.
FACT + INTERPRETATION.
It already models parts of the world.
Computers predict:
- weather;
- planetary motion;
- wave propagation;
- aerodynamics;
- molecular dynamics;
- the behavior of electronic circuits;
- the behavior of materials.
A small physical system—a computer—contains a model of another region of the physical world.
For a local region R and a time horizon T, the task looks like this:
Where:
S_R(0)is the known initial state of regionR;Brepresents boundary conditions and outside influences;Mis the model;S_hat_R(T)is the predicted state after timeT.
The model does not need to store every atom. It can use pressure, temperature, average velocity, geometry, effective fields, and other large-scale variables.
This kind of compression of physical reality is normal scientific practice.
FACT + INTERPRETATION.
Because the model can skip details that do not matter for the chosen question.
To predict a planet's orbit, we do not need to model every quark.
To calculate an eclipse, we do not need to wait for the eclipse itself.
To estimate how a bridge will vibrate, we do not need to reproduce the history of every electron.
Speed-up is possible through:
- rough averaging;
- symmetries;
- analytical solutions;
- reduced models;
- effective theories;
- adaptive time steps;
- parallel computing;
- surrogate neural models;
- skipping periods in which nothing important happens.
Instead of the full microstate, the model uses statistics that are sufficient for the chosen task.
The cost of prediction can be written conceptually as:
Q = L(M) + L(S_R(0) | M) + C_run(M, T) + E
Where:
Qis the total cost of building and running the prediction;L(M)is the description length of modelM;L(S_R(0) | M)is the description length of the initial state of regionR, assuming modelMis already known;C_run(M, T)is the computational cost of running the model until timeT;Tis the prediction horizon;Eis the cost of the allowed error or loss of accuracy.
A model can run faster than reality if we only need a limited set of observable values and finite accuracy.
There is an interesting asymmetry between compression time and decompression time. An LLM-based meta-archiver may spend megawatts of energy during compression, but produce a tiny script that decompresses in milliseconds.
SPECULATION with physical limits.
We need to separate three different cases.
It has fewer degrees of freedom, simpler laws, or a limited region. Such a system can physically be simulated faster than some processes in our world.
It keeps only large-scale features. Such a model may predict a local world for some time, until errors in the initial data, chaos, and outside influences destroy its accuracy.
Serious problems begin here:
- the internal computer has fewer physical resources than the complete system that contains it;
- it must somehow encode its own state;
- the full model contains a model that contains a model;
- an exact unknown quantum state cannot be copied freely because of the no-cloning theorem;
- the complete state cannot be physically measured;
- publishing the prediction may change the behavior of the system that receives it;
- some computations may have no general shortcut.
There is no simple theorem saying, “No Universe can ever simulate itself faster in any sense.” But an exact, complete, and internally accessible self-simulation faces limits from resources, quantum theory, and self-reference.
FACT + careful interpretation.
Quantum theory allows nonlocal correlations: the results of measurements on entangled systems cannot be explained by a simple local model with hidden values fixed in advance.
But quantum entanglement does not allow controlled information to travel faster than light.
So “nonlocality” does not mean:
- instant access to the full state of the Universe;
- the ability to download the future for free;
- the absence of causal limits;
- a universal channel for faster-than-light compression.
Nonlocal correlations make the fundamental picture of the world less classical. They do not give an internal observer the administrator password.
SPECULATION, but logically consistent.
Imagine that a civilization creates simulated worlds. Each such world:
- contains simplified laws;
- runs on the physical substrate of its parent world;
- may contain its own observers;
- may model the parent world using received data;
- may create the next level of simulations.
A chain appears:
Where:
W_0is the original physical world;M_1(W_0)is a model of part of the original world;- each next level models the previous one.
Each level loses something:
- accuracy;
- available scale;
- energy;
- speed;
- information about boundary conditions.
But a level may gain in subjective or model time if its laws and resolution are simpler.
In this sense, a nested universe can act as an archiver of local history:
It does not store every fact about the original world. It stores a program, initial data, and rules that allow the required region to be reproduced approximately.
It is more like a game engine plus a save file than a ZIP archive.
INTERPRETATION + SPECULATION.
Locally and for a limited time, this may be possible in principle if it receives:
- observations accurate enough;
- a suitable model;
- outside boundary conditions;
- enough computing resources;
- an allowed error range.
It may outperform observers in the parent world if its internal computation is organized more efficiently or if it uses discovered shortcuts.
But it does not automatically receive:
- the full initial state of the parent world;
- data from causally inaccessible regions;
- exact future quantum outcomes;
- an unlimited prediction horizon;
- a guarantee that its model is fundamental.
The most realistic version of the idea is:
A physical system creates a nested computing system that builds a compressed model of a local region and predicts several observable properties faster than they develop in the original.
This already happens in digital twins, scientific simulations, and control systems.
The strongest version—“a child world completely computes its parent world before the parent world reaches that future”—remains a philosophical fantasy.
FACT, with qualifications.
General relativity describes gravity as the changing geometry of spacetime.
Einstein's equation:
The symbols mean:
g_(μν)is the metric, which defines the geometry of spacetime;G_(μν)is the Einstein tensor, which describes curvature;T_(μν)is the stress-energy tensor of matter;Gis the gravitational constant;cis the speed of light;Λis the cosmological constant;- the indices
μ, νrefer to spacetime coordinates.
In a rough form, the meaning is:
The distribution of energy and momentum is connected to the curvature of spacetime.
The classical equations have an initial-value problem: with suitable initial data, evolution can be calculated. But global determinism depends on the structure of spacetime. Singularities, Cauchy horizons, and the absence of global hyperbolicity make the picture more complicated.
So even classical general relativity does not give us a simple slogan: “Give me one complete snapshot of the Universe, and I will give you its whole future.”
FACT.
Quantum field theory describes fundamental particles as excitations of quantum fields. The Standard Model describes electromagnetic, weak, and strong interactions with extraordinary accuracy, but it does not include gravity in a complete unified quantum framework.
Source: CERN, The Standard Model.
The quantum state between measurements is usually described by unitary evolution. But the connection between the mathematical formalism and one observed measurement result is still interpreted in different ways.
Different approaches say, for example:
- the state collapses;
- extra hidden variables exist;
- unitarity remains and branching histories appear;
- the quantum state is a tool for prediction rather than a literal object.
The experimental probabilities agree with observations extremely well. Their ontological meaning does not have one accepted answer.
Source: Stanford Encyclopedia of Philosophy, Philosophical Issues in Quantum Theory.
FACT.
No confirmed single theory fully unites:
- quantum field theory;
- general relativity;
- the origin of dark matter;
- the nature of dark energy;
- the initial conditions of the Universe;
- a quantum description of spacetime.
There are candidate theories and research programs: string theory, loop quantum gravity, asymptotic safety, causal sets, holographic approaches, and others.
This does not mean that “physicists know nothing.” General relativity and the Standard Model are extremely successful in their own domains. But their final common architecture has not been established.
The final nail has not been driven in.
INTERPRETATION, compatible with neuroscience but not a complete theory of consciousness.
An organism does not store a complete copy of its sensory stream.
The retina receives a huge stream of changes in light. Experience organizes it into:
- objects;
- faces;
- movement;
- threats;
- intentions;
- causal stories;
- stable space.
The brain discards a large amount of detail and keeps structures useful for action.
A conceptual scheme:
Where:
signalsare sensory data;latent modelis a hidden compact model of causes;prediction and actionare prediction and behavior.
In this sense, conscious experience can be compared to a lossy archive:
We experience not the microphysical state of the world, but a compact interface built by the nervous system.
But this does not mean that consciousness is a separate fundamental substance.
SPECULATION.
Let us make a strong and unproven step.
Suppose consciousness does not merely receive a shortened copy of reality, but is a mechanism that turns many physical possibilities into the experienced world.
Then consciousness can be imagined as a mapping:
Where:
Ωis the space of possible physical states;Eis the space of experienced states;Ais the “ontological archiver” that discards almost all microscopic differences.
Billions of different microstates may be experienced as the same “red table.”
Consciousness would then not read the full world. It would create equivalence classes:
This is a mathematically understandable metaphor, but it is not yet a physical theory. To become science, it would need to answer:
- What exactly does
Ameasure? - Where and how is it implemented?
- What experiment distinguishes it from ordinary neural processing?
- What numerical predictions are unique to this model?
- Why is fundamental consciousness needed instead of a physical computational process?
Until these questions are answered, the “ontological archiver” remains a philosophical image.
SPECULATION.
Imagine a radical model:
- original consciousness is the execution environment;
- the laws of physics are a virtual machine;
- our Universe is a running process;
- individual minds are local processes able to model the environment;
- the computers they create run new worlds;
- new observers may appear in those worlds.
This creates a recursive architecture:
original consciousness
└─ physical Universe
└─ biological consciousness
└─ computer model of the world
└─ model of an observer
└─ new model of the world
In this picture, the Universe archives itself not in one file, but in a hierarchy of observers and models.
Each observer receives a small fragment of the output and tries to reconstruct the program that produced it.
Physics becomes reverse engineering.
Science becomes decompilation.
Consciousness becomes a process in which code starts building a model of its own interpreter.
This is a powerful philosophical fresco. But it has no confirmed physical mechanism and no unique experiment.
ARTISTIC METAPHOR.
In Michelangelo's fresco, God reaches toward the reclining Adam. Their fingers almost touch, but a small gap remains.
That gap can be read as the distance between:
- the complete state;
- the shortest description;
- every causal connection;
- no computational uncertainty;
- knowledge of the program and its result at the same time.
- local measurements;
- limited memory;
- finite time;
- approximate models;
- archivers that may always turn out not to be the best;
- the inability to prove that a program is absolutely shortest.
Shannon stands on the engineer's side:
Give me a source model, and I will calculate the achievable average cost.
Kolmogorov reaches farther:
Every object has a shortest program.
Turing keeps the gap open:
There is no general algorithm that will always find it and confirm that the search is over.
The LLM extends one more mechanical finger:
I will not search every program, but I will try to write many plausible ones and test the best.
INTERPRETATION, based on the practical difference between the theories.
Because Shannon gave industry computable limits.
We can:
- measure channel speed;
- estimate noise;
- build a code;
- calculate average length;
- compare codecs;
- transmit data;
- check errors.
Kolmogorov gave us a more absolute idea, but its exact value cannot be obtained by a general computation.
Engineering standards choose not the most ontologically deep quantity, but the one that allows decisions to be made.
So:
- Shannon built the foundation of digital communication;
- Kolmogorov set a limit on the idea of absolute compression;
- MDL connected learning with the length of a model and its data;
- Solomonoff described an ideal predictor;
- modern LLMs became practical heuristics for searching for programs and explanations.
Kolmogorov did not disprove Shannon. He asked a question that Shannon deliberately did not answer.
FACT + INTERPRETATION.
It uses transformations chosen in advance.
ZIP, Zstandard, Brotli, PNG, FLAC
It predicts the data and encodes the remaining surprise.
neural codec, language model + arithmetic coding
It chooses a model and pays for both the model and the remainder.
It writes specialized generators and verifies exact restoration.
search for an idea → code → run → patch → verify → minimize
It searches through all programs in some priority order and looks for a short one.
In theory, this moves closer to Kolmogorov, but it quickly hits limits from time and the halting problem.
It knows the shortest program immediately and knows that it is the shortest.
Such an object is not a computable universal algorithm in the ordinary mathematical sense.
There are two temptations.
The first is to treat Shannon entropy as the absolute amount of information inside a thing. That is wrong. It belongs to a source and a probability model.
The second is to think that Kolmogorov gave us a ready path to absolute truth. That is also wrong. He defined an ideal, but no general algorithm can reach that ideal.
Real engineering lies between them.
LLMs can already write programs.
A program can generate data.
An agent can run it, compare the result with the original, and write another program.
A meta-archiver can spend huge computational resources to produce a small independent decoder.
It can discover laws where an ordinary codec saw noise.
This does not cancel Shannon. It expands the source model.
The physical Universe also creates archivers inside itself:
- genomes;
- nervous systems;
- languages;
- books;
- mathematical theories;
- computers;
- simulations;
- LLMs.
Each of them stores not the world itself, but a compressed reconstruction of it.
Perhaps short laws really allow a local model to run ahead of a physical process. Perhaps nested computational worlds will be able to predict limited regions of their parent world faster than those regions live through their own time. This already happens in part.
But an exact model of the whole Universe, located inside that Universe, knowing its own future and guaranteeing the shortest description, faces quantum limits, insufficient resources, self-reference, and uncomputability.
The idea of original consciousness as a mainframe that runs physics and recursive models of itself remains a magnificent metaphysical fresco.
It has not been proven.
It cannot be dismissed with one sentence.
It has not yet become a physical theory.
The final nail has not been driven in.
Perhaps the Universe has short source code.
Perhaps the code is short, but the initial state cannot be compressed.
Perhaps the description is short, but it cannot be executed quickly.
Perhaps the idea of “code” itself is an interface created by the human mind.
Or perhaps the mind is the procedure through which the world searches for a shorter program of itself.
Shannon gave us the price of the unknown.
Kolmogorov pointed to the hidden program.
Turing blocked the universal path to it.
The engineer starts the search anyway.
And between the engineer's finger and God's finger, a few bits still remain.
| Term | Meaning |
|---|---|
| Shannon entropy | The average uncertainty of a given probabilistic source |
| Conditional entropy | The uncertainty that remains after known context is included |
| Cross-entropy | The average cost of encoding data with a particular probabilistic model |
| KL divergence | The penalty for a mismatch between the model and the real distribution |
| Kolmogorov complexity | The length of the shortest program that produces a particular object |
| MDL | The principle of minimizing the combined length of the model and the data under that model |
| Universal coding | Coding without exact advance knowledge of the source |
| Program synthesis | Automatic search for a program that satisfies examples and constraints |
| Meta-archiver | An archiver that creates specialized archivers or decoders |
| Computational irreducibility | A situation in which a result cannot be obtained much faster than by running the process directly |
| Ontological archiver | The speculative idea of consciousness as a mechanism that forms experienced reality |
- Claude Shannon — A Mathematical Theory of Communication (1948)
- Claude Shannon — Prediction and Entropy of Printed English (1951)
- A. N. Kolmogorov — Three Approaches to the Quantitative Definition of Information (1965)
- Ray Solomonoff — A Formal Theory of Inductive Inference, Part I (1964)
- Peter Grünwald and Teemu Roos — Minimum Description Length Revisited (2020)
- Vaswani et al. — Attention Is All You Need (2017)
- Brown et al. — Language Models are Few-Shot Learners (2020)
- OpenAI API — GPT-5.5
- OpenAI — Using GPT-5.5
- OpenAI — GPT-5.5 System Card
- CERN — The Standard Model
- Stanford Encyclopedia of Philosophy — Philosophical Issues in Quantum Theory
- GitHub Docs — Writing mathematical expressions
My blog with more articles - https://tomfun.co/2026/06/god-vs-engineer/#more