How do we tackle noisy recognition?

Something I've been thinking about a lot lately is how humans handle noisy recognition. Maybe you recognize the image above, if not you probably haven't seen the original image. The fact is, many people do astonishingly recognize the person in the image, despite the fact that the image has been substantially downsampled and destroyed. However, the message/signal, the identity of the person, remains intact for us. How?

Meanwhile, AI visual models tend to fail hard at these perturbed recognition tasks, being vulnerable to even relatively mild amounts of image distortion.

The fact that current AI models can't recognize this feels like a missing piece. Though another part of me wonders if we should even want AI models to be able to recognize these edge cases.

Not for reasons around robots beating Captcha's or anything like that. More, is it a desirable trait for vision that we look at a smear of pixels and hallucinate something there that isn't exactly present?

We can look at things like ink blots and clouds and see entire stories within them, and while this can serve as an interesting microscope into the psyche, is it a beneficial trait to have to project our minds onto reality?

Here, I want to discuss the status quo of how AI currently "sees", how we might strategize an approach that allows for seeing obscured messages through "controlled hallucinations", and whether this might be desirable.

How AI Currently Sees

Computer vision, in most familiar forms, operates in a bottom-up fashion. An incoming visual signal discretized onto a grid of pixels arrives at the input layer of a neural network and is propagated all the way through to the end to be processed into an semantic representation. In short, image -> conceptual representation. This flow is unidirectional.

An alternative form of flow is top-down processing, where we proceed in the reverse direction, beginning from the idea and mapping back to the visual stimuli. In short conceptual representation -> image. This type of processing is typically reserved for generative tasks, where we may provide a text caption such as "a cat" to guide a model towards generating a picture of a cat.

There is some nuance we'll uncover as we go on, but these modes often appear separately from one another, and for tasks such as image classification, segmentation, encoding, and visual language modeling, it typically manifests as somewhat exclusively bottom-up.

Vision, at minimum, requires bottom-up processing. We need some means of taking in a signal and further processing it. However, it is possible to have have visual processing that makes use of both forms.

How Humans See

This section is summarizes this video and also borrows visuals from it.

When perceiving, humans make use of a mixture of top-down and bottom-up processing.

Higher cognitive areas constantly generate models of what we expect to perceive based on prior knowledge, context, and recent experience. These predictions flow downward through the neural hierarchy.

Meanwhile, raw sensory data flows upward from our sensory organs. Importantly, what moves up isn't all sensory data, but primarily the prediction errors - the differences between what was predicted and what was actually sensed.

The brain assigns different weights to prediction errors based on their estimated reliability. In clear conditions, sensory input gets higher weighting; in ambiguous situations, prior expectations may dominate.

The two streams attempt to align with each other and meet in the middle. We hold "hypotheses" about what might be in front of us, and send them downward. Meanwhile, the view of the object in front of us enters through the eyes and is sent upward. In other words, the final result of what we interpret is a mixture of what we guess and what actually enters the visual stream.

This is an efficient design that avoids computational waste by leveraging existing knowledge.

If the two streams agree early on, then there is no need for further processing, our hypothesis was correct. If not, the gates are opened for the visual signal to propagate further, and we adjust our hypotheses with the new incoming evidence. If you stare at the same thing for a while, there is no purpose in repeatedly processing the entire visual field. This becomes our new "default", and we then only need to process the changes or residuals from this new default. We depend on expectations, and only process unexpected information, which then becomes part of our expectation. The entirety of this scheme is known as Predictive Coding.

Additionally, the top-down stream allows for generating plausible predictions when information is incomplete. For instance, if we see a cat walking out from around a wall, but can only see its head, based on the entirety of our past experiences, we make the reasonable leap of faith that this head is attached to a cat-like body with the same fur coloration as well.

The top-down stream, in a sense, has the final say in this operation. We constantly generate "guesses" and then seek evidence to either confirm or reject our guess. Though it is also responsible for adjusting its beliefs when encountering errors or mismatches with new information entering our sight, aiming to minimize disagreement or "Free Energy"

The top-down component is also quite important due to how little of the world the eyes actually take in. Artist renderings to be taken with grain of salt.

https://www.researchgate.net/figure/Normal-Vs-Tunnel-vision-example-The-top-picture-shows-a-healthy-vision-and-the-bottom_fig2_331409336

https://www.reddit.com/r/Damnthatsinteresting/comments/mr5cv5/how_our_brains_process_the_raw_image_our_eyes/

There is much in our vision that we see that isn't actually present, and some things that are present but only obstructive, like our blind spot. Our top-down processing can learn to ignore the constant parts and fill in the gaps of the parts we need to know.

This assumptive mechanism is also the same reason optical illusions can fool our visual system, like how an inside out mask still looks right side up given how used to we are seeing faces in this manner. We have learned a number of predictive context clues depending on lighting, shadows and contiguity of edges that can sometimes fail us when applied in edge cases. It is also the reason we may see things in the dark that aren't there-- our top-down processing is still generating guesses, but it has little to no sensory input to keep it in check, causing it to dominate the resulting interpretation. Psychoactive drugs which alter our perception, can also be explained as interfering with the balance between top-down and bottom-up, as well skewing the actual mechanism of each.

Perception that works by mixing our expectations (priors) and raw senses (evidence) to produce the final result (posterior) is considered a Bayesian approach, a statistical framework that takes into account not just what we learn but all of what we already know.

We know at least this method of perception is desirable from an efficiency standpoint, but here, we'll imagine AI has infinite compute. What is in question is whether the strategy of cycling guesses with evidence until agreement is reached is a trait we want AI to have.

Dreaming is also an interesting case here. In a dream you both construct an experience, and experience it as well.

Thus far, it appears that it can be useful for generalization to new unseen scenarios, but optical illusions reveal a few failure modes. Though ultimately, whether the benefits outweigh the cons is up for debate.

These controlled hallucinations are incredibly useful when they are right, but are problematic when wrong. So if we do opt for this route, it seems essential to understand how to build in assumptions that are correct most often and can adapt readily to failures.

AI that performs both bottom-up and top-down

There are a few cases of neural networks, I would argue, that do make use of the dual-stream approach.

While a pure one-step generative model like normalizing flow models or GANs do not appear to be exercising bottom-up, multi-step methods like diffusion do.

Diffusion model generation begins with a canvas of pure noise, which we can analogize to a lack of signal, or a scrambled signal, like looking at a cloud or seeing in the dark.

At the very first step, without much context to work with, our guess at what the final image looks like is primarily top-down.

From our guess, we then slightly update or refine our canvas to add a bit of context to it, now treating it as signal we will then read bottom-up in the next step.

As we continue removing noise from the canvas the data is revealed, lessening the role of top-down processing and increasing the role of bottom-up.

Typically when a diffusion generation is visualized, we are observing the current state of the noisy canvas. However here is a visualization of what our guess is of the final denoised output at every step. In early steps, this changes significantly. Our context is low, so our hallucinated end result is in flux. But as we add more context, it starts to solidify as agreement is reached.

A key difference here is that we are allowed to update the data with our guesses. Normally in human perception, your guesses need to adapt to the static unchanging phenomena. However, here the guesses actually change reality and modify what is received bottom-up on the next step. The model takes a look, add its own hallucination on top of it, rinse and repeat until the hypothesis and the visual signal become the same.

Another example of this is Google's DeepDream. DeepDream takes a model that is purely bottom-up but runs optimization to reverse the flow of the model. At every step, we encode the canvas, compare it to a reference concept (for instance, how much does this image embody "dogness"?) and then optimize the image to embody more of that concept. Once again, we are modifying reality per se. The resulting images are quite reminiscent of psychedelic visuals, possibly hinting at what happens when top-down processing overpowers bottom-up

This is why outputs of models are often referred to as hallucinations. Not just when they hallucinate the wrong answer, but more generally the entirety of their outputs begin from nothing or partial context, and then are built by by a cycle of reading the current state and making a guess that is written back onto the canvas.

Hallucination is a feature, not a bug. It is the whole means by which we are able to generate new content. It is only problematic when the continued sentence by the language model tells me Paris is the capital of Antarctica. However, when the model correctly tells me Paris is the capital of France, it has still hallucinated one of many possible text continuations, just one that happens to align with our reality. Therefore, the holy grail lies in achieving controlled hallucinations.

However, when we feed an image to a visual language model for it to answer questions on, the visual signal itself is static. While the text thereafter is hallucinated, the image itself is processed entirely bottom-up.

How could we achieve top-down/generative recognition in AI? What are the advantages?

A few thoughts come to mind as to how we may develop recognition models with top-down processing.

This is a prime example of where we'd want to be cautious. Upscaling a heavily compressed image of someone reveals the model's priors/biases, biases that do not align with our own. With a popular image of someone famous like Obama has a high level of expectation/bias given the iconicity, plus perhaps the fact that the features we most depend on for recognition may remain more intact despite the destruction, the distorted signal is enough to trigger proper identification for us. However, for the model, imaginably trained on a dataset of disproportionately white people, is biased in a different direction.

Now the funny thing is with guesses from partial, compressed, or distorted information like this is that there is no right answer. For all we know, both we and the model are wrong and the distorted picture of Obama is actually a close Obama look-alike. When going from low information to higher information content, we inevitably have to fill in and hallucinate the gaps. It is a one-to-many problem in that there are many possible high information data that when compressed could give the exact observed downsampled version. Every person and AI model has different inductive biases as to what they will assume, and its quite possible someone who has never seen Obama could make a guess of something like the example on the right.

Confidence

Where this becomes appalling is the confidence the model appears to show. When people don't immediately recognize a downsampled image, we might admit, we don't know who it is. Instead of jumping to a conclusion, an exact sample, we may sit in the superposition of all the possibilities it could be without actually committing to any answer. The model on the other hand, whether a stochastic or deterministic upsampler, is asked to deliver a sample instead of a superposition. I would equate this to being given a downsampled image of a person and someone forcing you to make a high-fidelity sketch of what this person could look like. You'd have to rely on your biases to fill in the gaps.

A lot of models are actually capable of exposing their uncertainty, its just not often revealed. For instance, the sharpness of a VAE distribution or a categorical distribution for discrete reveals something of the model's confidence in its answer.

Nonetheless, it feels important to use confidence/uncertainty measurements as a guardrail against incorrect hallucinations. This is some of the philosophy behind projects like Entropix which use the entropy of model predictions to choose what words a language model puts out.

A Common Language to communicate by

One neat thing about sharing similar biases and priors is that it enables communication of data that would be stored in a large number of bytes but as very compressed versions. Despite the highly compressed Obama, we, the humans, got the message across, and you could probably compress the image quite a bit further. A similar case for the photo up by the title of this post.

We can speak in highly compressed codes because we know what the code represents. This is also why we can convey rich ideas in the language we speak, but are clueless to languages we are unfamiliar with. More subtly, when it comes to slang or words with multiple meanings, people of your in-group may immediately get what you mean, but those with different biases/expectations might understand your speech to mean something else. This is one place humans may still be able to beat machines on Captcha's. We share biases and experiences that we are yet to crack with AI, and thus gravitate towards a common "answer" or that is not an obvious choice for machines.

This was especially apparent for problems in the ARC-challenge set. As a primer, given the sequence 1,2,1,2,1,2,_,_. what would you imagine comes next? Probably 1,2. Though this is not a "correct" answer. The masked portion of the sequence could be anything, maybe it starts going 1,2,3,4,1,2,3,4 and looping that until 1,2,3,4,5,6. A new pattern. But we are all commonly biased to recognize the original a reasonable solution, so much so that it feels obvious. The ARC challenge was similarly interesting in that AI got some problems right that appear very difficult to humans, and some wrong that appear obvious to humans. We differ in how the solutions we gravitate towards in ambiguity. Humans are right in similar ways and wrong in similar ways to other humans, emphasizing the common mental models we share.

I would go as far to say that these specific biases, what's hallucinated in every scenario, embodies not only the remaining gap between AI and humans, but also what differentiates humans between each other and makes each one unique.

Multimodal LLMs hallucinating in the image space.

Note, this is more fun speculating than anything, I don't have a ton of intuition to go off to imagine these approaches would result in aligning with human biases and solving our problem.

Normally, multimodal LLMs receive clean images, in contrast to the diagrams of human vision which has a narrow focal point. So perhaps the opportunity to make use of hallucinated vision isn't really present in the first place given the image is already fully intact.

So, it would seem we'd want to somehow perturb the image + give the model a chance to restore the perturbation before generating text. The kind of noising we choose biases the model into relying on certain features, i.e. gaussian noise wipes out higher frequency details forcing dependence on lower frequencies. In the limit, one could craft a perturbation strategy that would aim to mimic human vision by giving it the same flaws (but also need to give it the same pros somehow as well? Like boosting outlines or contrast?)

Though, performing something like masking and prediction or noising and denoising with a separate model just to feed it through to the VLM afterwards just feels like a data preprocessing, or at the very least, somehow this would need to be jointly trained and part of the same objective.

Somehow, what I'd like to do is create a sequence of refinements preceding the LLM's answer. Something like this

At train time, only the noisy observation would be shown. In place of noising, one could also remove patches of the image, or apply blur around a focal point.

I'd like to somehow also encapsulate a looped reasoning process, where we alternate between a text hypothesis and an image hypothesis. To do this, I think we could rearrange to move text to the middle. This can then be looped by truncating context, noising the new reconstruction, and then let that serve as the noisy observation.

What I would imagine doing here is looping this process until we reach a steady state where neither text nor image end up changing much, this would be perhaps a sign that generative processes and recognition processes are in equilibrium. One might notice that this somewhat resembles a VAE where everything is in context. A noisy observation is encoded into text stochastically and then stochastically decoded.

One issue here though is I could imagine potentially veering off away continually rather than converging. To ensure we have a way of anchoring ourselves to the original data, we could potentially show two noisy observations:

one which is the original noised data
one which is the most recent noised sample

Multimodal LLMs hallucinating in the text space.

The second approach I would imagine is... already happening! When you give a VLM an image and have it reason in text space before giving an answer, I would argue that generated text serves as hallucinating on top of the original visual observation.

As more text is generated, later text then conditions on both the original image, but also previous text it has seen.

If we imagine prompt injecting "This may look like a cat, but don't be fooled, its a human in a costume, consider this when delivering your answer" this would inevitably affect our final output.

I would call this "External Hallucinations," where text is injected into the reasoning chain influencing how the model interprets the image.

If we imagine having a multi-turn conversation with an LLM, giving it an ambiguous picture like the Downsampled Obama, it might volley ideas at us as to what the picture contains and be met with our feedback. After enough back and forth, its answer can be influenced by the picture plus the whole script of text containing both its own generated text plus ours.

A criticism with this approach is the actual image features, when encoded, are entirely deterministic and invariant to anything else the model writes after.

As stated previously, I don't know if this is exactly a desirable trait, nor how much these methods will give us what we're looking for. After all, the first method I could imagine possibly harming performance. Though it would be really interesting to know what it would take to get a model to recognize the picture at the top of this without seeing something like that in a dataset.

How do we tackle noisy recognition?

How AI Currently Sees

How Humans See

AI that performs both bottom-up and top-down

How could we achieve top-down/generative recognition in AI? What are the advantages?

Recent Posts

Comments