This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
The work on the capacity of discriminators was super cool.
> much harder to train than transformers
There's plenty of GANs that use transformers. PWC seems to be redirecting to GitHub currently but IIRC about half of top scores on FFHQ256 were GANs with transformers in them. I know that the number 2 was, I saw it at CVPR. It was a lot smaller and had higher throughput than the diffusion models it was outperforming.Though the main reason diffusion took over was for the ability to encode more diversity. I still think there's a place for GANs and we overcorrected by putting too much focus on diffusion, but there are a lot of fundamental advantages to diffusion. Though they aren't strictly better, there's no global optima for solution spaces this large. I think the ML community (maybe CS in general) has a tendency to take an all or nothing approach. I don't think this is a really good strategy...
[0] https://arxiv.org/abs/2211.05770
[1] https://arxiv.org/abs/2106.07631
[2] https://arxiv.org/abs/2206.02262
And beyond that. Sometimes I feel like AI research is reinventing wheels that exist elsewhere. Maybe just the wheels, but still.
This all reminds me of the bias term of a perceptron.
And with transformers we started with not having any and the network repurposed one of the inputs for that which annoyed people because now dropping this particular input makes the whole thing be unreasonably affected but also annoyed some other people because weight on that input was unreasonably high because it sort of balanced all others.
So initially people (from hanlab) tried to affix this input so it doesn't get dropped. Then they (from openai this time) decided to just skip the input by providing learnable bias inside of the network (doing the thing that classical perceptron does) and now this guy proposes further optimization just by setting bias to 1 everywhere, which might work perfectly fine since we don't really care about absolute values because ultimately we just pick largest one and don't care what it was. So in training all other weights just get scaled by the bias so it can be 1. It's little like doing physics calculations with speed of light set to 1.
If you have simple feed forward network of perceptrons where ultimately in the end you just pick the largest output and don't care about absolute values then maybe you'd also be fine with just setting all perceptron bias terms to 1 and excluding them from the learning.
Is bias learnable in biological neurons? Doesn't activation potential threshold (or whatever it's called) rely on some chemistry and isn't the same for all neurons?
Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...
This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.
Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).
I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting
However, that seems to be contradicted by what was shown recently with the successful International Math Olympiad effort. Their prompts, such as https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro... , were very terse. It's hard to tell where the prompt stops and the CoT response starts, in fact.
So there is probably some interplay between the need for attention sinks and the use of step-by-step reasoning. It might not be too surprising if the latter works because it's an indirect way to optimize the former.
It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.
That's because you're looking at the final output that includes neither the prompt nor the intermediate chain of thought.
That said, now I'm wondering if all those dashes it spews out are more than just window dressing.
Oh, the irony.