I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.
it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
Specifically, your model now has two "correct" classes p(class=y|x) and p(class=⊥|x). This makes the results ambiguous. The way you resolve this is by adding in a cost of missclassification and a cost of answering wrong.
L(y, y') =
0 if y=y' l_err if y≠y' and y'≠⊥ l_⊥ if y' = ⊥
You can then estimate the expected error over your dataset. Notice that this now gives you additional degrees of freedom: Depending on how expensive answering wrong is compared to not answering at all, your predictor might be really bad or really good.
This means when benchmarking with a "no answer" action, you are often not actually benchmarking whether the model works well or not, but rather are benchmarking how well the model _happens_ to agree with the class-error weight you (implicitly) chose in your model.
So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.
Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)
An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.
(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)
This leads to answer bloat and/or hallucination if you benchmaxx on those
Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.
If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.
Data at https://gertlabs.com/rankings
In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).
GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).
It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.
Whatever it is you're measuring, it's not anything related to what I use models for.
What are you using Claude models for? Coding only? Computer use? Which harness?
I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two.
But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.
AMD’s stock price reflects a hope they launch a CUDA alternative. But this is unlikely for the near future.
There is a lot of interest in preventing China coming in with cheap AI hardware.
So I expect the direction to be good local models that few can run effectively.
I can't say I'm as optimistic about there continuing to be an open market for foreign LLMs.