In my own work circa 2016 I used this approach in Counterpoint by Convolution (https://arxiv.org/abs/1903.07227), where we in turn argued that despite being an approximation, it leads to better results. Sadly being dressed up as an application paper, we weren't able to draw enough attention to get those sweet diffusion citations.
Pretty sure it goes further back than that still.
I collected a few other text-diffusion early references here about 3 years ago: https://github.com/madaan/minimal-text-diffusion?tab=readme-....
This doesn't have an explicit diffusion tie in, but Savinov et al. at DeepMind figured out that doing two steps at training time and randomizing the masking probability is enough to get it to work reasonably well.
https://joecooper.me/blog/crosstalk/
I’ve still got a few ideas to try though so I’m not done having fun with it.
Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.
So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.
Well no, there is attention in the LLM which allows it to look back at it's "internal thought" during the previous tokens.
Token T at layer L, can attend to a projection of the hidden states of all tokens < T at L. So its definitely not starting anew at every token and is able to iterate on an existing plan.
Its not a perfect mechanism for sure, and there is work to make LLMs able to carry more information forward (e.g. feedback transformers), but they can definitely do some of that today.
If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.
What happens to you as a human if you come up with a plan with limited information and new information is provided to you?
Yes, they can plan within a single forward pass like you said, but I still think they "start anew at each token" because they have no state/memory that is not the output.
I guess this is differing interpretations of the meaning of "start anew", but personally I would agree that having no internal state and simply looking back at it's previous output to form a new token is "starting anew".
But I'm also not well informed about the topic so happy to be corrected.
At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.
At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.
At token 3, it can see the states from token 2 and 1 etc.
However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.
It's correct to states the LLM starts anew for each token.
The work around for this is to pass the existing plan back into it as part of the context.
It's still recalculating, just that intermediate steps are cached.
This lets them "save" the plan between tokens, so when regenerating the new token it is following the plan.
In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.
Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).
Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).
This is how progress is stored.
Presumably the internal state at any given token position must also be encoding information specific to that position, as well as this evolving/current memory... So, can this be seen in the internal embeddings - are they composed of a position-dependent part that changes a lot between positions, and an evolving memory part that is largely similar between positions only changing slowly?
Are there any papers or talks discussing this ?
I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing
They call diffusion a form of "spectral autoregression", because it tends to first predict lower frequency features, and later predict higher frequency features.
But before starting your sentence, you internally formulate the gist of the sentence you're going to say.
Which is exactly what happens in LLMs latent space too before they start outputting the first token.
I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.
Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.
I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.
Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.
So really there is no argument to be made, because we still don't mechanistically understand how the brain works.
When I read that text, something like this happens:
Visual perception of text (V1, VWFA) → Linguistic comprehension (Angular & Temporal Language Areas) → Semantic activation (Temporal + Hippocampal Network) → Competitive attractor stabilization (Prefrontal & Cingulate) → Top-down visual reactivation (Occipital & Fusiform) → Conscious imagery (Prefrontal–Parietal–Thalamic Loop).
and you can find experts in each of those areas who understand the specifics a lot more.
You are undoubtedly technically correct, but I prefer the simplicity, purity and ease-of-use of the abysmal model, especially in comparison with other similar competing models, such as the below-discussed "tarpit" model.
This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.
If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?
(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)
Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.
Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.
But I can't believe the actual literal words are diffusing when you're thinking.
When being asked: "How are you today", there is no way that your thoughts are literally like "Alpha zulu banana" => "I banana coco" => "I banana good" => "I am good". The diffusion does not happen at the output token layer, it happens much earlier at a higher level of abstraction.
"I ____ ______ ______ ______ and _____ _____ ______ ____ the ____ _____ _____ _____."
If the images in the article are to be considered an accurate representation, the model is putting meaningless bits of connective tissue way before the actual ideas. Maybe it's not working like that. But the "token-at-a-time" model is also obviously not literally looking at only one word at a time either.
The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.
Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).
At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.
Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?
Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.
Taking a step back to the start, we're wondering:
Do LLMs plan for token N + X, while purely working to output token N?
TL;DR: yes.
via https://www.anthropic.com/research/tracing-thoughts-language....
Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.
In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.
I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)
Wrong. There's "model state", (I assume you mean hidden layers), not just in the generated text, but also in the initial prompt given to the model. I.e. the model can start its planning from the moment it's given the instruction, without even having predicted a token yet. That's actually what they show in the paper above...
> It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable
This is an assertion based on flawed reasoning.
(Also, these ideas should really be backed up by evidence and experimentation before asserting them so definitively.)
Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”
Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.
That's why saying "it's just predicting the next word", is a misguided take.
I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."
It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?
It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.
What happens in the black box of the human mind to determine the next word to write/say is exactly made irrelevant in this level of abstraction, as regardless how, it would still result in a linear sequence of actions as observed by the environment.
Clearly communication is sequential.
LLMs are not more sequential than your vocal chords or your hand writing. They also plan ahead before writing.
The associated paper[2] goes into a lot more detail, and includes interactive features that help illustrate how the model "thinks" ahead of time.
[1] https://www.anthropic.com/research/tracing-thoughts-language...
[2] https://transformer-circuits.pub/2025/attribution-graphs/bio...
I have a hot dog _____
The word in the blank is not necessarily determined when the sentenced is started. Several verbs fit at the end and the LLM doesn't need to know which it's going to pick when it starts. Each word narrows down the possibilities:
I - Trillions Have - Billions a - millions hot - thousands dog - dozens _____ - Could be eaten, cooked, thrown, whatever.
If it chooses cooked at this point that doesn't necessarily mean that the LLM was going to do that when it chose "I" or "have"
This is a common but fundamentally a weird assumption people have about neurology where they think that what they consciously perceive has some bearing on what's actually happening at the operational or physical level.
When you speak or do anything, you focus on what you’re going do next. Your next action. And at that moment you are relying on your recent memory, and things you have put in place while doing the overall activity (context).
In fact what’s actually missing from AI currently is simultaneous collaboration, like a group of people interacting — it is very 1 on 1 for now. Like human conversations.
Diffusion is like looking at a cloud and trying to find a pattern.
One of my stumbling blocks with text diffusers is that ideally you wouldn’t treat the tokens as discrete but rather probably fields. Image diffusers have the natural property that a pixel is a continuous value. You can smoothly transition from one color to another. Not so with tokens. In this case they just do a full replacement. You can’t add noise to a token, you have to work in the embedding space. But how can you train embeddings directly? I found a bunch of different approaches that have been tried but they are all much more complicated than the image based diffusion process.
https://openreview.net/forum?id=c05qIG1Z2B
They're doing continuous latent diffusion combined with autoregressive transformer-based text generation. The autoencoder and transformer are (or can be) trained in tandem.
Feeding the model the following input pattern:
[Source UTF8 bytes] => [Corrupted Target UTF8 bytes]
I expect it to output the full corrected target bytes. The overall training process follows this curriculum: Curriculum Level 0: Corrupt nothing and wait until the population/model masters simple repetition.
Curriculum Level 1: Corrupt 1 random byte per target and wait until the population/model stabilizes.
Curriculum Level N: Corrupt N random bytes per target.
Rinse & repeat until all target sequences are fully saturated with noise.
An important aspect is to always score the entire target sequence each time so that we build upon prior success. If we just evaluate on the masked tokens, the step between each level of difficulty would be highly discontinuous in the learning domain.Ive stopped caring about a lot of the jargon & definitions. I find that trying to stick things into buckets like "is this diffusion" gets in the way of thinking and trying new ideas. I am more concerned with whether or not it works than what it is called.
One of the powerful abilities of text diffusion models is supposedly in coding. Auto-regressive LLMs don't inherently come with the ability to edit. They can generate instructions that another system interprets as editing commands. Being able to literally unmask the parts you want to edit is a pretty powerful paradigm that could improve if not just speed up many coding tasks.
I suspect that elements of text diffusion will be baked into coding models like GPT Codex (if they aren't already). There's no reason you could not train a diffusion output head specifically designed for code editing and the same model is able to make use of that head when it makes the most sense to do so.
Some start with random tokens, or with masks, others even start with random vector embeddings.