Kolmogorov Complexity

Ethan Smith

Mar 113 min read

Updated: Mar 14

Mandelbrot function: Zn+1 = (Zn)^2 + C | Location: -1.4732524061369524549 + -0.0058138265122775765014 i, Radius: 2.7866674154672533345e-05 made with Xaos

Kolmogorov Complexity, an approximate synonym for Minimum Description Length, is a concept in information theory: the shortest possible means of describing something without any loss of clarity. It's about distilling complexity to its purest form.

Consider this string of letters: "aaaaaaaaaaaaaaaaaaaa". We could present it verbatim, taking up 20 characters of space. Or we could express it more elegantly as "a"×20. The latter formulation—assuming there's no even more concise way to communicate "the letter 'a' repeated 20 times"—would be the Minimum Description.

The Minimum Description Length isn't just about counting characters, though. More precisely, it measures information in bytes, giving us insight into the fundamental informational content of data. At its core, MDL helps us understand what makes data truly meaningful versus what's merely redundant.

5000 pages of the letter "a" over and over stored as an uncompressed file on our computer could take up millions of bytes. But we recognize this is a ridiculous way to store predictable and useless data. We can reproduce an exact copy of the same document by just storing the instruction "write the letter "a" until it fills 5000 pages", this instruction being much shorter.

This is effectively how compression works. Instead of storing the data as is, we can swap it out for an equivalent representation that we can switch back and forth to, a function, code, or instruction that produces the data.

Another example is this image here.

This is 512x512 pixels of the color [255,0,0] RGB, a solid block of the reddest red your computer knows. In raw form, this would cost:

256 possible values for each color = 2^8 = 8 bits = 1 byte

3 color values x 1 byte x 512^2 pixels = 768,432 bytes = 768KB

Let's say now we had a program where we gave it 3 values: A RGB color, a height, and a width. In this case, we would only need the number of bytes to represent 512 twice, the color value, and the number of bytes to represent the program.

Taking a step back, why did we decide that every color value requires one byte? If we wanted to get very granular about the kind of red we wanted to specify, we could imagine it costing us more if we wanted to put 254.99999999 for example.

On the other hand, we could instead feed our program color values as just 1 bit. Instead of each color having a scale of intensity, we just say whether it is all the way on or all the way off, present or absent.

There are many different ways to compress a piece of data. The best compression method, with respect to how much smaller we can compress the data without losing fidelity, we might say yields an Empirical Minimum Description Length. It's the shortest way we've found to represent a piece of data, but for all we know there could be even shorter representations.

The Actual Minimum Description Length is often unknown, and the best compression method we have is only an "upper bound." We can't be sure there aren't even smaller compressions we haven't yet discovered.

The plot thickens as well when we consider lossy compression.

JPEG can allow for compressing an image to even smaller files, but when we materialize our image again, it is not quite the same. But it can be close enough depending on the context and what the message we want to convey is. In other words, if I showed you a lightly JPEG compressed picture of yourself, and you still recognized it, and the intention was simply to convey that the picture is of you, the the message is intact. While we have lost information about the exact pixel intensities, we did not lose information about identity.

Some compression is also contextual. What I mean by this, is that it can require you to know something about your data. The compression caters to the general rules of your typical slice of data, but not necessarily to the datatype more broadly.

An example of this is comparing language model tokenizers to raw character-byte representations.

Normally, uncompressed text, expressed by one ASCII character at a time, costs 1 byte of storage per character. We have 256 different symbols to choose from. To choose one is to provide a string of 8 0's or 1's to index the one we want, like 0010 1010.

A language model tokenizer learns a vocabulary, where symbols now can be individual characters as before, but they can also chunks of words. A common vocabulary size could be ~64k=2^16, which means each symbol costs us about 16 bits, or 2 bytes.

Despite each symbol costing us more, compared to the per-character storage method, we can sometimes express more in less bytes. For example, here "super" is one of the symbols of our vocabulary, and so is "man".

For this language model, Superman becomes 2 tokens or symbols, each 2 bytes for 4 bytes total (assuming 64k vocab size).

Meanwhile per-character S-U-P-E-R-M-A-N comes out to 8 characters, and 8 bytes.

This can work in the opposite direction though. Here, we do not have symbols present in our vocabulary that can represent longer chunks of this specific string than simple individual characters.

Here, the tokenizer yields 5 tokens of 2 bytes each for 10 total, whereas byte tokenization yields only 5 bytes.

Because we are usually using tokenizers on natural language and not random arrangements of characters like above, we achieve successful compression. The way we have chosen our vocabulary, to have longer chunks of characters that reflect common parts of words. This kind of compression hinges on the fact that the data we're compressing is primarily a predictable subset of the full space of possibilities.

Namely, in this case, we are playing in the small corner of text of the english language within the space of all strings of characters. Random strings of characters do not fare as well.

Trying to understand the fundamental information content of a message is what we aim to get at. Compression methods serve as approximate upper bounds of

Kolmogorov complexity. In other words, because we found that the long string of "a"s could be expressed 1:1 as "a"x20, we know the information of "aaaaaaaaaaaaaaaaaaaa" can be no more than that of the instruction: "a"x20. But if we can find a better one, that becomes our new upper bound.

Machine Learning models are often thought of through the lens of compression. An image generator is trained on billions of images, up to petabytes (1 million gigabytes) of data. But the model itself may only be a few gigabytes of trained weights. This somewhat parallels with our example of how solid color square could be represented not as an array of every single color, but as the program that generates the images. Granted, we won't be able to recover the billions of images seen, but what is recoverable is the general concepts, objects, and patterns seen in the images. The preserved message here is the "grammar" of what constitutes a natural image.

Another example is the Autoencoder. Autoencoders are drawn typically something like this.

https://www.compthree.com/blog/autoencoder/

They are trained to take in a big piece of data, compress it down to a smaller size, and then be able to recover it as accurately as possible. They look like what they do.

You can think of it sort of as how we've come up with all kinds of compression algorithms for images like JPEG, WEBP, PNG... but instead of hand-crafting the algorithms, we task a neural network with learning the compression. The bottleneck in the middle incentivizes the network to extract the most important features from the data, or the ones that offer the greatest improvement to the decoded reconstructions.

To make this more concrete, we can imagine a a pendulum. We can denote where the mass is with just one variable: the angle of the string with respect to a vertical line down the middle. This is isomorphic with the number line, its just a 1 dimensional variable describing the location of the pendulum (not including physical properites like mass, velocity, etc.)

So in theory, we could imagine compressing frames of this video into just a single number when we restrict the context of the compression to just this view of this exact pendulum. In practice, does this work?

Fairly well actually. After training an autoencoder with a bottleneck of only 1 dimension on this video, I tried decoding values from [-2.5, 2.5] into views of the pendulum.

Code here

It's a bit scuffed, which we can in part blame on my makeshift autoencoder and a lack of optimization hyperparameter tuning, but it mostly gets how moving along the continuous number line maps into continuous changes in the position of the pendulum. It's not uncommon for features discovered by autoencoders to have interpretable high-level meaning to us, suggesting something about the way we encode our visual field into understandings, and possibly something even more generally about fundamental compressed semantic structures of our world.

Another example even simpler than using neural networks is Principal Component Analysis (PCA) This is closely related to eigenvector analysis, and also worth noting that a one layer linear autoencoder approximates the result we would get from PCA. PCA illuminates the directions of greatest variance, which is more or less the features of greatest importance as described earlier.

Eigenfaces was an experiment in this vein. A large dataset of many faces was constructed. Following, principal component analysis was performed to extract a specific set of Eigenfaces, such that any face could be described as a weighted combination of the said eigenfaces.

An analogy I like is food recipes. While every recipe is unique, they are composed of common ingredients mixed in different ratios. Some ingredients are more essential than others. A pizza won't go very far without dough. Meanwhile a garlic sprinkle at the end doesn't make or break the existence of the pizza.

Similarly, PCA can identify which ingredients contribute more subtly, and in interest of compression, we can prune those out first thereby preserving the most important components.

A real world experiment of this is from the DALLE2 aka UnCLIP paper, where image representations are decoded back into images, but using fewer and fewer principal components. I like to think of it as us reducing the "semantic resolution" of the decoded image, where we get a coarser, less specific look at its content.

Tangential to the main point here, but they also showed how you can interpolate between two images. While mixing two images in pixel space just looks like they've been superimposed on top of each other, mixing them in this compressed feature space yields interpretable outcomes, emphasizing how each feature corresponds to high level explaining traits like overall color, shape, patterns, etc.

Complexity Arises from Simplicity

Something interesting to ponder on is, how much complexity we can achieve from as simple programs as possible? This is the final "golf" of our reality.

The featured image of this post is a picture of the Mandelbrot fractal. The fractal is something you can explore infinitely. You can zoom in deeper and deeper and still find patterns you've never seen before.

Surprisingly though, the Mandelbrot is criminally simple. This is all it is.

It is just a recursive function in the complex plane. To create an image, you sample across many points. If the point stays stable or oscillates, you can color it black. Other points will diverge to infinity, if it does, we can color that point based on the number of iterations it takes to diverge, i.e. red=1, blue=2...

The other numbers I left in the caption are to tell you where to look for this exact spot in the mandelbrot, and how zoomed in we are.

The saved image takes up several megabytes on my computer. However, we could write a program that uses the mandelbrot equation, performs the coloring, and spits out an image at the requested coordinates. All of that costing us maybe 10s of bytes or so.

If this isn't convincing enough, there's a neat cult of cracked individuals writing shaders with incredible visuals from very short programs that can fit in a single tweet, links in captions.

vec2 p=(FC.xy*2.-r)/r.y,i;for(p=p/(2.-p.y)/.1+t;i.x<22.;p+=.4*sin(p.yx*++i+i*i*r+t)/i)o=sin(o+length(cos(p*.3+i))*9.);o=.6/(.6+abs(o)/exp(.4*vec4(p-=t,-p)))

vec2 p=(FC.xy*2.-r)/r.y,a;for(float i;i++<20.;i,o+=(cos(sin(i*.2+t)*vec4(0,4,3,1))+2.)/(i/1e3+abs(length(a-.5*min(a+a.yx,.1))-.05)))a=(fract(t*.2+p*.3*i*rotate2D(cos(t*.2+i*.2)))-.5),a=vec2(a.y,abs(a));o=tanh(o*o/2e5);

There are also other examples of this, like John Conway's Game of Life. A simple program over a grid that flips positions on or off depending on a very simple rule set.

https://complexsystemstheory.net/2020/04/13/john-conway-and-the-game-of-life/

Despite its simplicity, you can build all sorts of complex systems, like calculators, clocks, or even the whole Game of Life program can be built within itself. I like to think of it as a programming language, where the code you write is the initial cells you set on/off, and the rules carry the rest.

In general, a rule of thumb for many of the kinds of compression discussed here is that we switch from a data representation (storing the thing as is) to a function representation (storing a program and a "seed input", such that when run, we recover the desired thing as an output of the program). This reduces memory/storage needs while raising compute needs, a common dilemma in computer science.

Where this gets interesting the span of everything, our entire universe, and what its minimum description length is.

There are many estimates of the number of bytes it would take to store a state of our universe. This paper estimates it to be around 10^90 bits, which is approximately 1.03 × 10^65 yottabytes. To get a sense of how massive this is, the entirety of the internet is just a few zettabytes. And to put it in more relatable terms, a yottabyte is 1000 zettabytes, which in turn is 1000 petabytes, which in turn in 1000 terabytes. An argument

The raw number of bytes is unfathomably large, but can we compress it?

Rather than trying to store a state of the universe as is, can we convert from the data representation to the function representation. In other words, can we describe it as a rule set, an initial state, and an amount of iterations/time it takes to reach this state?

Similar to the Mandelbrot and Game of Life, the universe's ruleset actually isn't terribly complicated.

We have to consider gravity, how electromagnetic waves work, how sub-atomic particles function, and so forth, but not really a ton beyond that. We don't have to write out the rules for every organic chemistry reaction or how cells function. We can skip to the most fundamental rules of the universe and trust that processes at higher levels of abstraction emerge naturally from the sum of processes at the smallest components.

One complication is that unlike the other examples, the universe evolves stochastically, a symptom of quantum indeterminacy. Our initial conditions are no longer the sole player in outcomes. Intractable randomness injected in via quantum events steers our course in ways seemingly out of our hands.

This does not present an issue for our function in achieving plausible universe states, but to achieve 1:1 compression, we'd have to treat it "pseudorandomly" where we fix all random events as deterministic, incurring additional storage costs. We would effectively need to maintain a history of the outcome of every quantum decision and ensure this too is reproduced.

But let's forget compression for a moment and pivot our focus to creating new universes and what that requires from us.

With a ruleset, an unimaginably powerful computer, and a random seed dictating the initial conditions and all further quantum events, we can simulate universes, just like that.

Not so fast though. Thus far, when we've talked about compression, we've talked about storage costs, not RAM costs. We've talked about it as being able to recompute some final result by only storing the instructions. Though in decoding to obtain and hold this final result, we may find ourselves back at the original memory cost.

I would describe this similar to storing materials for building a tent or something like an inflatable bounce house. When its not in use, we can keep it stashed away pretty compactly. Though when we want to make use of them, they need to be deployed to their full size.

Similarly, when we decode the 512x512 image of red pixels, we end up with 512x512 LCD squares on our screen all set to [255,0,0]. The .mp4 and .mov files of video can be compressed to a comfortable number of gigabytes. However, a 100,000 frame video (27 minutes at 60fps) at 1920x1080 resolution, when all frames are fully materialized simultaneously as floating point values, can exceed 2 terabytes.

Another example is the Game of Life. While we can efficiently store the initial conditions, generally when performing the update steps we may have the full state of the grid in memory to compute over it. There here are probably some ways given that an updates of a square is only contingent on immediate local surroundings rather than global context, though the mileage varies and the compression is contextual, it hinges on a certain amount of order.

This is somwhat similar to how in Minecraft, among other games, there is a "rendering distance" where only structures at some distance away from the player are rendered into the world. Farther away, the structures vanish, but their data is maintained should we want to retrieve it. The fully rendered in-memory form is much more expensive, so where possible we aim to keep things in the compressed storage form when not in use.

Back to our universe, we have a similar issue. Even if we can represent a given moment of our universe as a smaller set of instructions to achieve that result, as we run the simulation forward, we have effectively a full state of the universe in memory. Again, we might be able to leverage certain heuristics. For a given timestep in the physical world, we also have a property of locality. At the most granular level, things generally require direct contact to interact. So as we go through an update the position of every single atom in the universe (or whatever the most basic unit is) it would be theoretically possible to keep things outside of this view in a more efficient, temporarily inactive, 1:1 representation if there is one.

Though its uncertain how far we could push this. It seems overall a common view is that a bottleneck to simulating the universe with full fidelity (capturing the exactness of every subatomic particle) is that it would require a computer as large or complex as the universe itself. Currently, it's believed we cannot create something of greater information content from something of less. It would require the entirety of our universe to create something matching the complexity of our universe.

Is this the nail in the casket for living in a simulation and one day being able to create our own simulations? Not at all.

If we can drop our fidelity requirement just a smidge and accept a slightly lower resolution universe, we can have much more leeway. Not too dissimilar from how JPEG on some images can yield significantly files well beyond 10x smaller while still preseving most perceptual quality. Instead of carrying everything in bulk, we can start by identifying what facets of our universe are most important and start pruning the components less important towards an objective. We can aim to create universes that are extremely convincing even if not exact. Maybe everything outside of direct human perception can run on a "low-power mode." Or maybe the real original universe has recognized particles beyond the smallest we're aware of in our world and they've been omitted for the simulation we all exist in. Have you ever thought it was strange how the speed of light, this one strange magic number, is a cosmic speed limit?

The complexity of our universe is absurd. Every attempt to imagine something more complex inevitably draws from elements within it. Things like math, supernovae, computers, urban industrial sprawls, the millions of unique lifeforms on earth, all both at the macro and molecular level have intricacies that are quite complex despite the fact that all of this was born from the starting conditions of the universe, a relatively simple ruleset and billions of years for things to sort themselves out and fall into place.

Kolmogorov Complexity

Recent Posts

Comments