One of the most interesting things about the development of AI was the order of achieved milestones. Relatively small models can create convincingly masterful lifelike art while a much larger language model may fail common sense problems that your average 10 year old may have no problem with. But then at the same time, they are incredibly capable. The same language models failing at some trivial tasks may also explain difficult math concepts in a very intuitive manner and write genuinely interesting prose. How do we reconcile with this?
Does it say something fundamental about the difficulty of things? Is it a matter of the medium and modeling techniques applied? For instance the diffusion model gets to de-jitter pixels until a work of art is revealed while the human is bottlenecked to local view of a canvas and must create it stroke by stroke. Is it a matter of what our criteria of success is?
My personal view is it's a mix of all of the above.
There may be a more objective universal difficulty metric for creating an output that may not always be correlated with what humans find difficult. The ways humans have adapted has put a priority on ensuring the tasks we are good at are the ones that are most integral to survival and everyday functioning, whatever is most evolutionary favorable. If there was somehow an intelligent creature on this earth whose evolutionary fitness metric for many millenia was rated by the quality and/or quantity of art it created- its lifeline and shot of preserving its bloodline into the future depending on it, who knows in what kinds of ways it would adapt and what it may consider easy versus challenging?
Another avenue for human bias is the ways in which we rate success or our threshold for being impressed. For instance, something like art or poetry has many "correct" answers, and the right dosage of spontaneity is a quality often rewarded. Meanwhile, a math problem generally has one correct answer that must be reached, and the reasoning along with the work shown may only have a few acceptable paths. Thus there are fewer trajectories with strict criteria that must be met for us to be convinced.
Recalling for a moment that popular generative models like diffusion models or autoregressive transformers effectively funnel random samplings into desirable information and trajectories, it would seem that the number of desirable trajectories relative to all that could be outputted at least partially explains what makes a task difficult. Going through 100s of matrix multiplies with each forward pass and placing all the tokens necessary in order to fall upon a few correct trajectories feels like trying to reach the last pin in bowling, but throwing from the moon (while the way probability weighs outcomes very much constrains this, its still pretty nebulous! Just mentioning this before I get hit with "but so do humans!") . This perhaps gives a reason for part of the benefits around chain-of-thought, self-reflection, and other prompting methods. By offering scaffolding, signal from external verifiers, and the ability to extend rollouts further by correcting itself and backtracking, we can help avoid veering off track as well as opening the door to more paths that can reach an answer.
While it might be frustrating that the difference between a correct answer and an incorrect one could be a matter of RNG noise, or random samplings, it's an incredibly interesting part of the generative process. In this post, I want to talk about how in the right doses, with the right guidance or filtering, and with the right luck, noise generates novelty. Novelty right at the fine line between collapse into incomprehensibility and striking gold on an uncharted frontier.
Noise for exploration
It's well known that noise can be used to more extensively search a space and balance exploration vs exploitation tradeoffs. To make this more concrete, we'll go through an example.
Let's imagine we have a space of all the wolves in the world. Our job now is to somehow obtain dogs from steps of cross-breeding.
In this simplified model, each point represents a wolf. Normally we should imagine this in a space probably 1000s of dimensions instead of two. We can imagine though our x-axis corresponds to something like amount of fur and y-axis to fur color, then just imagine there are many, many other dimensions we can't visualize representing similar ideas.
To start, we'll imagine that genetic mutations are not present here. There is no randomness. Crosses here will be represented as linear interpolations in space, the main tool at our disposal. We'll assume there are no "invalid" crosses, all result in viable offspring. We'll also remove gender from the equation, we can perform crosses between any two points.
If this is the case, we are actually fully confined to a limited search space defined by the biggest outliers in each direction. Let's run some crosses to see what we mean by this.
The blue points represent our next generation. Each sits somewhere on the line between the two parents, randomly deciding how far along that line. We'll run one more generation.
Notice that we can't leave the convex polygon defined by the original red points, vastly confining the extent of what we can search. All the new points with each generation are a predetermined kind of "new". They are all novel and unique in that they haven't previously existed, but obscured behind a fog of war that we have a clear path to. But we still can't reach uncharted regions. Enter noise.
Some form of randomness is generally an essential part running an evolutionary optimization algorithm, with the main questions being around what the right dosage is for having good coverage of a space while still being able to zero-in on a solution once we've found one. Not too dissimilar from how different real life species have different mutation rates, and how mutations can be become as much an obstacle resulting in unviable offspring as a benefit helping increase overall evolutionary fitness. (side note: see Hinton's wonderful theory on dropout in neural networks and a possible evolutionary analogue in reducing reliance on highly co-dependent genes in the motivation section)
Let's start again from the beginning, but this time after each linear interpolation we randomly jitter where the point ends up.
See now that we've broken free from our enclosure! Now its much more reasonable to imagine after many generations of crosses, and the right selection of what bloodlines continue, we could end up with dogs.
Kernel regression methods very literally interpolate data points, trained neural network classifiers also do this, for something like generative models like diffusion and LLMs, this is possibly also a valid reductionist view though I’m a tad hesitant to say so with certainty.
When we generate sequences of text with a language model, there's two forms of generation: stochastic sampling and argmax sampling.
At every step of generation the language model does not predict the next word but instead presents a list of probabilities of every word in its vocabulary being the next word (technically token). Argmax sampling will choose the highest probability word at every single step, yielding a single deterministic trajectory. Any form of stochastic sampling will randomly choose a word from its vocabulary, weighted by the predicted probabilities. Under stochastic sampling, any sequence of text could occur generally with non-zero probability, though it may be exceptionally rare.
In this case, "The dress color was lymphoma" is a possible arrangement of words, though a very rare and unexpected outcome.
Even rarer, might be completing "The dress color was _____" not with a single next word but instead a description of your entire life story, from everything that has already happened to everything that will happen. Profoundly rare, but not impossible. This is the principle behind the Library of Babel, a thought experiment of an infinitely large library, containing every possible text that could exist.
Nowadays, many sampling techniques with LLMs utilize stochastic sampling. If we presented the LLM with a math problem and hoped to solve it with argmax sampling, we have only one shot at getting it right as there is only a single deterministic string we produce. Instead, we could imagine running stochastic sampling 100 times, and trying strategies like Majority Voting (what is the answer that the majority of generations give?) or if we have a means of rating generations by their results or thought process, we could prioritize those with promising reasoning while pruning out those that have screwed up early on.
Another trick one can do is adjust the temperature of the distribution. Effectively, decreasing temperature rescales the distribution such that higher probabilities surge higher while lower probabilities drop even lower, thus "sharpening" the distribution. Raising temperature on the other hand will give you back a uniform distribution in the limit where every word has equal probability. You can see as well in the diagram below that using lower and lower temperatures will eventually place the entirety of the probability mass on a single word, bringing us right back to argmax sampling.
One interesting observation is many users have cited that higher temperature sampling can yield more diverse and interesting outputs. This is not a coincidence. Diversity of output lies in the definition. We upweight the probability of less likely outputs. Following, you're bound to explore rarer regions of the space of text, similar to the point made with our "The dress color was lymphoma" example.
However, the connection to how interesting outputs are could use some elaboration. Rarer outputs are said to be of higher entropy, which is sometimes called surprisal. This should make sense, Lymphoma is a surprising way to describe a dress color. On the other hand, lower temperature sampling might lead to extremely predictable, possibly cliche, sounding text. In the right doses, this "off the beaten path" style of sampling should allow for us to encounter novel pieces of text. You increase the risk of nonsensical text, but some generations might wield unexpected and clever prose, hence its interestingness.
From this, my read is that novelty is born from the far-fetched moonshot attempts of traversing uncharted areas of data space, but the subset of those that happened to still make enough sense to have meaning to still captivate human attention.
This is a subjective evaluation, meaning the criteria of interestingness and making sense, and what we consider "order", is specific to the human experience. We can achieve maximum surprisal by just entirely randomly picking words out of a hat like the following:
preship malproportion angelology anarchical bonnyvis
dowless embossman oryctics protonymphal amorpha
piratize determinate spyboat roentgenograph snod
peritrichan magnetoid dressline mocuck preprofess
Though ironically, despite these being the most surprising outputs we can create giving the greatest potential for novelty, they all look almost equivalently nonsensical!
Similarly, here is 4 images of randomly sampled noise.
Again, they are all extremely different from each other, but not in any way that is meaningful to us. The eyes kind of just glaze over looking at this stuff.
Now something that may have been novel and interesting is the pioneering works of the Impasto painting style by Van Gogh, which deviated from typical art styles at the time.
Compared to what other kinds of art were prevalent at the time, this may have been properly surprising. Since his fame and widespread iconography of his work, it is probably less surprising, which reflects the fact that surprisal or the amount of "information" something conveys is relative to its context or distribution. In another universe, this could have been the norm, and photorealistic-style paintings were what shocked the world. This fact took me a long time to really understand deeply, that any given data point in isolation is meaningless, perhaps as meaningless as the random noises sampled above, and that meaning is only introduced when there is a context. Furthermore, a work in machine learning image classification showed that the features that image classifiers latch onto may not always correlate with human perception, and what appears random to us may be deeply meaningful to machine learning models! It just goes to show how highly specific our perceptual preferences, our language, and our visual grammar are in the space of all possible understandings.
For a similar reason, I have appreciated the AI generated imagery of smaller and older models. They hopelessly misfired at a faithful model of the distribution of images, though enough sense was often conveyed for it to be interesting.
The question I have now is, is noise a fundamental component for creating novelty? Noise being the sporadic, unpredictable, and sometimes unstable nature of humans.
In another post, I tried to break down what randomness even is, and concluded much of what we call noise is actually meaningful processes that lie beyond our modeling capabilities, which for scientific purposes, it has sufficed to factor it in as noise.
So possibly we shouldn't write this off as noise, but instead look at is as a deliberate seeking of novelty away from the rest of the crowd.
In thinking about the human drive to seek out novelty balanced with the criteria that things must still make sense to us, I attempted to create a generative model that is judged by both of these called NoveltyGAN. The GAN objective asks that generated images be perceptually convincing, but then the trick is we use a pretrained density model (Diffusion, autoregressive, normalizing flow, VAE ...) and flip the sign of the loss, which asks for the images to be of lower probability with respect to its context, prompting exploration of rarer regions of space. I had very limited play with it, and my gut tells me using diffusion for the density estimate might not be as great as something like a normalizing flow, which has recently showed the capability to make for expressive generative models on diverse datasets. A possible pitfall is that the model can create adversarial outputs that are not conventionally perceptually-convincing but still enough in a way to trick the discriminator, satisfying both constraints but never reaching images that are inteligible to us. Though as an area of study, I think modeling how we can balance novelty-seeking with fidelity to an accurate model of the world is very interesting.
Predetermined Novelty
Pivoting back to the previously mentioned point on predetermined novelty, I wanted to discuss the search processes that go into things like art and research, both things I have have had some experience with firsthand.
Art and research can both be considered as a searching over a space or an optimization algorithm. Ultimately, we want to reach an output that has value, whether that be academic merit, advancing new technologies, an observation that one's painting skills have improved, or aesthetics or expression appealing to self or others. Let's also not forget the value and enjoyment of the journey itself, especially for art, though that's a different topic.
I would argue that art, for a number of reasons, has a much more ambiguous search space and noisier search process. For one, the landscape of value one consults when making art varies substantially person to person. We could assess the aggregate sense of value held by the masses, and this is typically what reward regression models learn assuming they've been trained on a properly diverse bag of ratings by many people. Though regardless, its likely that our own individual, subjective senses of value still inform the process. Secondly, as mentioned earlier, there are many "correct" answers, and its not always obvious what that is. The criteria for "good art" is qualitatively ambiguous. In the face of that uncertainty, our limited perspective and biases, and how our own desires drive the process, I would argue we are less driven towards a universally agreed-upon ideal and instead prompted to explore quite sporadically.
On the other hand, research can be made very rigorous. Research typically involves firm quantifiable and objective criteria, making use of prior work to predict future outcomes, and a strict process to ensure reproducibility. Whereas art welcomes all kinds of approaches, research generally asks that all follow a set of rules to participate in the conversation and deliver robust findings. "Make a nice piece of art" is a far less constrained objective than "Make the language model faster", and depending on the kind of person you are, whether you love the expanse of creative freedom or need narrower well-defined structure in your endeavors, either objective may be more frightening.
To emphasize though, this is my read of the conventional cases for these lines of work. Art could be made into a stricter optimization process trying to reward-hack what brings in revenue, and some independent researchers, especially in open-source, may come up with their own processes leading to novelty and value in their own right.
In my experience, research in AI often feels like there is a looming hand of fate guiding its direction. There are some cases the writing is practically on the wall as to what the next step in development is, and this can often lead to several concurrent works exploring the same technique. Somehow, and I say this naively, many of our milestones feel like they were destined to happen, and rather its just a question of when. For instance, I wonder across how many universes, starting from the date the first computer was created, transformers became the dominant network run on GPU hardware, or where where the paradigm of self-supervised pre-training followed by reinforcement learning arose, or convolutional networks had a period of fame. It feels like fate because it feels like there is an already existing set or rules written into the universe dictating what works and what doesn't, and in pushing along different fronts and discovering what continues to advance and what hits a dead end, we naturally follow some kind of predetermined river flowing us towards desirable outcomes, though perhaps not in a required order.
This very simplified view won't reveal how methods may resurface or how the river may split and recombine but I think it gives a rough idea of the process. This river isn't something we can see, though its there, shaping and filtering our work, not so different from feeling your way through the dark and using the walls to guide your path. I would describe the progress of research to resemble most closely to something like Particle Swarm Optimization, in that we have parallel ongoing searches in different regions, the deepest minima attracts the most attention and hands, though we have the means for a good bit of coverage over less obvious regions as well.
In due time, with lots of mistakes and some breakthroughs, we are coaxed to glide down the slopes of optimization towards what our universe (and current hardware) allows to be performant. Roughly, I like to imagine it as each era of research is spending time converging to a minima, seeing how deep that minima truly is, and then by chance discovering a new frontier suggesting there are even deeper minima out there with an illuminated path towards them.
That red line illustrating the breakthrough, but initially having worse results than current methods is what keeps me up at night. Quite a few times, a new method is produced and it is clear that this will change the game. Other times it may not be so clear. It makes me wonder how much has been shelved because it lacked the hardware for it to be practical at the time, or maybe never got the right attention for lots of minds to tinker with and see the future paths? If you weren't around for it, try to imagine what it would have been like to see the first works on transformers or diffusion models, and whether the strength it has today would have appeared predictable. What is happening right now that could amount to having a similar caliber of legacy if anything?
Predetermined novelty is less obvious with art. There are far less rules. We could make a few educated guesses, like that the photorealism style was perhaps bound to arise. It feels like it aligns with human's imitative nature and offers a contest of who can paint a reference most faithfully, catering to our competitiveness. Science-fiction kinds of art feels also pretty likely, that a fascination for progress and extrapolation into the future may inspire how we create. Though it generally feels a bit handwavey and harder to formalize.
Strangely though, my experience with making art has felt closer to my description of research in some ways. I never picked up an aptitude for painting or other means of creating art. Though at the start of high school, I had discovered the remove background tool for images in Microsoft Word along with the formatting that allowed you to overlay multiple images together to make collages, which was a common passtime during unexciting classes. Though the degree of freedom was fairly limited. Soon after, I had received a license to photoshop which my school offered. The photoshop interface felt like an airplane cockpit with its overwhelming amount of controls, though by curiosity, I had to try all of them.
To me, this was a very different experience for creating art. I didn't need to have the dexterity to maneuver a brush, Ctrl+Z offered infinite forgiveness, and the layering/visibility switches allowed me to maintain multiple versions of an ongoing piece.
The overall process was pretty ritualistic. I'd grab a few images that seemed like they could have some potential, though generally did not have a clear vision in mind. Then I'd start running down my list of go-to strategies for adding flair or general interestingness to the canvas. I would then visually assess those that were worth continuing with and those that did not fit, often making several checkpoint copies, so I could recover if a direction I took did not work out. From there, it was rinse and repeat, gradually moving outward from my typical methods to simply brute-force trying every single offered tool.
I ironically found this process stressful. To begin with, the space of all next steps to take was huge, and the possibilities for compositions of multiple steps grew combinatorically. Though I had to try as many directions as I could for fear that I would overlook a promising path. Secondly, my evaluation criteria was fickle. Towards the end of a piece, I would fiddle around with fine adjustments to the color sliders, sometimes at what I think was an imperceptible level, for up to days on end. I would switch between versions back and forth obsessively while having no read on which was actually better.
This style of search over a set of paths made it feel less like I was creating and more that the end result already existed in a predetermined way. In other words, it was almost that I was pulled towards this result through a similar means to the research description, via iterative improvement and filtering out dead ends. All of this decided by my evaluation criteria and policy for chosen actions, which, while this is always in flux and may vary moment to moment, I realized I was predictable enough to have a general sense of where things would go and feel the sense of inevitability.
I can't speak to the experience of creating art outside of my own experience, and while there is still lots of iteration, I do picture it as generally being less annoyingly structured than I have made it out to be.
This spectrum of free-flowing exploration and constrained optimization has made me think a lot about the industrialization of art and the "Marvelification" of media. Marvelification, the namesake taken from Marvel, a prime perpetrator, is a term I've used to describe the dwindling creative ambition in film, leading to excessive clichés, hollow characters, and high spend on flashy CGI eye candy to hide away that the film was not built from expression but instead a cash-grab formula.
It makes sense. The budget that goes into films is massive. Investors want to have faith that they will see a positive return. Identifying the key factors that have historically worked to drive box office and exploiting them, from an economic perspective, I would say is pretty wise. Though as a viewer, you can smell it, it's palpable. It is clear which age demographic each inserted joke is meant to appeal to, how a film is engineered to bring out as many kinds of audiences as possible, and how it's already been planned out in advance how to turn the characters into McDonald's happy meal toys. Sequels, despite the poor reputation for often falling short when compared to the original, have the benefit of higher odds of bringing back the same crowd from the first success. This is why we're now on Ice Age 6, Toy Story 5, and Shrek 5 upcoming in 2026. I have long felt that Marvel films were incredibly stale, but it's more recently reached a point of blatantness that people are very much picking up on it and meming the shallow writing.
Aside from needing to serve as a profit machine, the other aspect that may make it hard for art and novelty to thrive in this setting is the extent of the production team. Generally, its at least hundreds if not thousands of total crew members involved. There's two potential issues I could see with this scale. One is that tasks become divided and so granular that I imagine it can lend itself to tunnel vision and a loss of focus for the big picture. Granted, there are specific roles like directors that are tasked with maintaining that focus, but everyone's involvement and execution is relevant. Secondly, it may also suffer from the common pitfalls that are seen with complex bureaucracies and big corporations. Namely, friction to execution and having to play a game of telephone to communicate a vision into action, among other obstacles. It's strange to think of the creation of art needing to have Jira tickets.
I see this as a pro of Ai art
I'm on the Van Gogh section, which is a direction I really like. But this gave me a thought -- what is the bare minimum of order in a system (a sentence, a painting, etc) necessary for us to care about it? (Give it meaning, pay attention, find a narrative in it, an emotion, etc...)
I assume it's actually very little
Like how we see shapes and images in clouds or stars
exploitation vs exploration
I think you need to try shit
all over the place
but also somehow have a good knack or heuristic for singling out quality
kinda wild though how so much can be framed as a search space, people have really said the final frontier of skill for humans is taste, and im increasingly understanding that, having a good way to search a massive space and single out what works is definitely a skill have you ever heard of the exploration/exploitation tradeoff?
yes, exactly this. for example, alphachess would basically consult a search tree over all possible moves in a chess game, and simulate how they could play out if the opponent also acts optimally. But this quickly becomes like way way to massive to model computationally. so you need to have a heuristic, you need to be able to quickly trim down that tree and have a good way of only searching paths that matter, and somehow having a way to know that ahead of time. Another example is if you wanted to generate the works of shakespeare word for word. The monkey on typewriter solution would take a very long time though they might eventually reach it. You could probably ease this time a bit more if you gave monkeys keys that corresponded to picking a word rather than an individual character. Meanwhile an LLM is constrained by grammar and other predictive rules of text thus vastly narrowing its search space
Comentarios