Once you get to "British cats <next-token-here>" you can't get to "British munchkin cats <next-token-here>"; the tokens to the left are done and dusted.
It's kind of a feature. Diffusion is used for images, right? It's like saying, once the image of a door has started to form right next to a kitchen counter, it cannot insert a refrigerator there any more. Well, maybe it doesn't "want to" because that layout is already settled by that time.
Further more, you're applying the logic of AR LLMs to diffusion models. AR LLMs are only seeking the probability of the next token (a chain of conditional probability), but diffusion LLMs are modeling the probability of the entire output at once. Because of this token structures that leads to invalid outputs should be extremely low probability if properly trained.
And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.
Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements
I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old
It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.
I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers. By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.
I'm convinced they're all doing everything they can in the background to cut costs and increase profits.
I can't prove that Gemini 3 is dumber than when it came out because of the non deterministic nature of this technology, but it sure feels like it.
Google caps at $250
Train one large model, then down configure it for different pricing tiers.
also, if you have inference optimizations why not apply them to all models?
It seems completely implausible.
I could believe that if a $20 sub used every possible token granted, it would cost $250. But certainly almost no one was completely milking their subscription. In the same way that no one is streaming netflix literally 24/7.
For example, the Qwen3 technical report[1] says that the Qwen3 models are architecturally very similar to Qwen2.5, with the main change being a tweak in the attention layers to stabilize training. And if you compare table 1 in Qwen3 paper with table 1 in Qwen 2.5 technical report[2], the layer count, attention configuration and such is very similar. Yet Qwen3 was widely regarded as a significant upgrade to Qwen2.5.
However, for training, they doubled the pre-training token count, and tripled the number of languages. It's been shown that training on more languages can actually help LLMs generalize better. They used Qwen2.5 VL and Qwen 2.5 to generate additional training data by parsing a large number PDFs and turning them into high quality training tokens. They improved their annotation so they could more effectively provide diverse training tokens to the model, improving training efficiency.
They continued this trend with Qwen3.5, where even more and better training data[3] made their Qwen3.5-397B-A17B model match the 1T-parameter Qwen3-Max-Base.
That said there's also been a lot of work on model architecture[4], getting more speed and quality per parameter. In the case of Qwen3-Next architecture which 3.5 is based on, that means such things as hybrid attention for faster long-context operation, and sparse MoE and multi-token prediction for less compute per output token.
I used Qwen as an example here, from what I gather they're just an example of the general trend.
[1]: https://arxiv.org/abs/2505.09388
[2]: https://arxiv.org/abs/2412.15115
[3]: https://qwen.ai/blog?id=qwen3.5
[4]: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d...
AFAICT that's mostly because what you're getting when you select a "model" from most of these cloud chat model providers today, isn't a specific concrete model, but rather is a model family, where your inference request is being routed to varying models within the family during the request. There's thus no one number of weights for "the model", since several entirely-independent models can be involved in generating each response.
And to be clear, I'm not just talking about how selecting e.g. "ChatGPT 5.2" sometimes gets you a thinking model and sometimes doesn't, etc.
I'm rather saying that, even when specifically requesting the strongest / most intelligent "thinking" models, there are architectural reasons that the workload could be (and probably is) routed to several component "sub-models", that handle inference during different parts of the high-level response "lifecycle"; with the inference framework detecting transition points in the response stream, and "handing off" the context + response stream from one of these "sub-models" to another.
(Why? Well, imagine how much "smarter" a model could be if it had a lot more of its layers available for deliberation, because it didn't have to spend so many layers on full-fat NLP parsing of input or full-fat NLP generation of output. Split a model into a pipeline of three sub-models, where the first one is trained to "just understand" — i.e. deliberate by rephrasing whatever you say to it into simpler terms; the second one is trained to "just think" — i.e. assuming pre-"understood" input and doing deep scratch work in some arbitrary grammar to eventually write out a plan for a response; and the third one is trained to "just speak" — i.e. attend almost purely to the response plan and whatever context-tokens that plan attends to, to NLP-generate styled prose, in a given language, with whatever constraints the prompt required. Each of these sub-models can be far smaller and hotter in VRAM than a naive monolithic thinking model. And these sub-models can make a fixed assumption about which phase they're operating in, rather than having to spend precious layers just to make that determination, over and over again, on every single token generation step.)
And, presuming they're doing this, the cloud provider can then choose to route each response lifecycle phase to a different weight-complexity-variant for that lifecycle phase's sub-model. (Probably using a very cheap initial classifier model before each phase: context => scalar nextPhaseComplexityDemand.) Why? Because even if you choose the highest-intelligence model from the selector, and you give it a prompt that really depends on that intelligence for a response... your response will only require a complex understanding-phase sub-model if your input prose contained the high-NLP-complexity tokens that would confuse a lesser understanding-phase sub-model; and your response will only require a complex responding-phase sub-model if the thinking-phase model's emitted response plan specifies complex NLP or prompt-instruction-following requirements that only a more-complex responding-phase sub-model knows how to manage.
Which is great, because it means that now even when using the "thinking" model, most people with most requests are only holding a reservation on a GPU holding a copy of the (probably still hundreds-of-billions-of-weights) high-complexity-variant thinking-phase sub-model weights, for the limited part of that response generation lifecycle where the thinking phase is actually occurring. During the "understanding" and "responding" phases, that reservation can be released for someone else to use! And for the vast majority of requests, the "thinking" phase is the shortest phase. So users end up sitting around waiting for the "understanding" and "responding" phases to complete before triggering another inference request. Which brings the per-user duty cycle of thinking-phase sub-model use way down.
... and you'd be most likely very correct with your doubt, given the evidence we have.
What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.
I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.
Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.
They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.
[1] for lack of a better term that I am not aware of.
I take "you can't have human-level intelligence without roughly the same number of parameters (hundreds of trillions)" as a null hypothesis: true until proven otherwise.
It's unfair to take some high number that reflects either disagreement, or assumes that size-equality has a meaning.
> level of quality
What is quality, though? What is high quality, though? Do MY FELLOW HUMANS really know what "quality" is comprised of? Do I hear someone yell "QUALITY IS SUBJECTIVE" from the cheap seats?
I'll explain.
You might care about accuracy (repetition of learned/given text) more than about actual cognitive abilities (clothesline/12 shirts/how long to dry).
From my perspective, the ability to repeat given/learned text has nothing to do with "high quality". Any idiot can do that.
Here's a simple example:
Stupid doctors exist. Plentifully so, even. Every doctor can pattern-match symptoms to medication or further tests, but not every doctor is capable of recognizing when two seemingly different symptoms are actually connected. (simple example: a stiff neck caused by sinus issues)
There is not one person on the planet, who wouldn't prefer a doctor who is deeply considerate of the complexities and feedback-loops of the human body, over a doctor who is simply not smart enough to do so and, thus, can't. He can learn texts all he wants, but the memorization of text does not require deeper understanding.
There are plenty of benefits for running multiple models in parallel. A big one is specialization and caching. Another is context expansion. Context expansion is what "reasoning" models can be observed doing, when they support themselves with their very own feedback loop.
One does not need "hundred" small models to achieve whatever you might consider worthy of being called "quality". All these models can not only reason independently of each other, but also interact contextually, expanding each other's contexts around what actually matters.
They also don't need to learn all the information about "everything", like big models do. It's simply not necessary anymore. We have very capable systems for retrieving information and feeding them to model with gigantic context windows, if needed. We can create purpose-built models. Density/parameter is always increasing.
Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts, can disseminate multiple perspectives on a matter in parallel, boosting context expansion massively.
A single model cannot refactor its own path of thoughts during an inference run, thus massively increasing inefficiency. A single model can only provide itself with feedback one after another, while multiple models can do it all in parallel.
See ... there's two things which cover the above fundamentally:
1. No matter how you put it, we've learned that models are "smarter" when there is at least one feedback-loop involved.
2. No matter how you put it, you can always have yet another model process the output of a previously run model.
These two things, in combination, strongly indicate that multiple small, high-efficiency models running in parallel, providing themselves with the independent feedback they require to actually expand contexts in depth, is the way to go.
Or, in other words:
Big models scale Parameters, many small models scale Insight.
But a smart person who hasn’t read all the texts won’t be a good doctor, either.
Chess players spend enormous amounts of time studying openings for a reason.
> Multiple small models, specifically trained for high reasoning/cognitive capabilities, given access to relevant texts
So, even assuming that one can train a model on reasoning/cognitive abilities, how does one pick the relevant texts for a desired outcome?
Repeat ad-nauseam.
I wish the people who quote the blog post actually read it.
Scaling laws are real! But they don't preclude faster processing.
There is no gain for anyone anywhere by reducing parameter count overall if that's what you mean. That sounds more like you don't like transformer models than a real performance desire
A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.
So if you have a user that has many automobiles, maybe instead of Autos: [...] you could do Auto1Make Auto2Make etc.
I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.
But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.
Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?
If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.
Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.
This is a whole new paradigm of AI.
But I wish there were more "let's scale this thing to the skies" experiments from those who actually can afford to scale things to the skies.
It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.
https://huggingface.co/tencent/WeDLM-8B-Instruct
Diffusion isn’t natively supported in the transformers library yet so you have to use their custom inference code.
If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.
If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.
I'm really curious about this, I'm but a simple client developer, so I don't actually grok some of the differences.
For lack of a better word, there's a "normie" position that "omg diffusion means it can edit!!111! big unlock!" -- I think that's cute but I also don't see it as intuitively correct. And I guess I don't even know why I don't see it that way. But regardless, it sounds like I'm correct there.
> If not, I think autoregressive models will still be preferred.
But here I get lost, at least so far, diffusion models seem strictly significantly faster, and on par with models with the same parameter count.
If that is the case, why would autoregressive models still be preferred?
Asking this also makes me realize I am treating "diffusion models are better" as a premise, if I'm asserting they're always faster and ~same quality...
Check out this paper where they use diffusion during inference on the autoencoded prediction of an autoregressive model: https://openreview.net/forum?id=c05qIG1Z2B
Although the lab that did this research (Chris Re and Tri Dao are involved) is run by the world's experts in squeezing CUDA and Nvidia hardware for every last drop of performance.
At the API level, the primary differences will be the addition of text infill capabilities for language generation. I also somewhat expect certain types of generation to be more cohesive (e.g. comedy or stories where you need to think of the punchline or ending first!)