Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.
LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.
You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.
We really have no idea how much AI is ruining education right now.
Any of the following could work, though the specific tradeoffs & implementation details do vary:
- have <n> teachers walking around the room to watch for cheaters
- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles
- record from above and use AI to flag potential cheaters for manual review
- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)
- build dedicated examination rooms lined with metal mesh to disrupt cell reception
So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.
After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).
The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.
(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)
If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.
But any trick prompt like this is going to start giving expected results once it gets well-known enough.
Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.
Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.
One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:
> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20
Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?
I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.
No. You're only arguing LLMs are useless at regurgitating homework assignments to allow students to avoid doing it.
The point of education is not mindless doing homework.
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.
Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.
In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.
As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.
Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.
Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.
> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.
Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."
Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!
Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!
> but I still want to know precisely how.
Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.
Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.
Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.
The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.
The AI will create something for you and tell you it was them.
"Good point! Blah blah blah..."
Absolutely shameless!
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
https://openai.com/index/learning-to-reason-with-llms/
The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.
And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
Just feels less "stable" or "tight" overall.
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.
USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
IMO : International Math Olympiad
SOTA : State of the Art
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/
https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.
If that's the peak of "AI" evolution today, I am not impressed.
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Like how
- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/
Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance
https://dynomight.substack.com/p/chess
Discussion here: https://news.ycombinator.com/item?id=42138289
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.
I.e. if you randomly sampled N humans to take those tests.
It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.
The entire point of USAMO problems is that they demand novel insight and rigorous, original proofs. They are intentionally designed not to be variations of things you can just look up. You have to reason your way through, step by logical step.
Getting 25% (~11 points) is exceptionally difficult. That often means fully solving one problem and maybe getting solid partial credit on another. The median score is often in the single digits.
That's true, but of course, not what I claimed.
The claim is that, given the ability to memorize an every mathematical result that has ever been published (in print or online), it is not so difficult to get 25% correct on an exam by pattern matching.
Note that this is skill is, by definition, completely out of the reach of any human being, but that possessing it does not imply creativity or the ability to "think".
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
LLMs fundamentally do not want to seem anything
But the companies that are training them and making models available for professional use sure want them to seem agreeable
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
A standard way of talking used by people who do also frequently claim real agency.
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
Yes. The issue here is control and NLP is a poor interface to exercise control over the computer. Code on the other hand is a great way. That is the whole point of skepticism around LLM in software development.
And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.
Or maybe I just suck at prompting hehe
Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.
The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.
umm, it seems to me that it is this (tfa):
But I would nevertheless like to submit, based off of internal
benchmarks, and my own and colleagues' perceptions using these models,
that whatever gains these companies are reporting to the public, they
are not reflective of economic usefulness or generality.
and then couple of lines down from the above statement, we have this: So maybe there's no mystery: The AI lab companies are lying, and when
they improve benchmark results it's because they have seen the answers
before and are writing them down.
There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very good indeed,
But when she was bad she was horrid.
In fact, this might be why so many business executives are enamored with LLMS/GenAI: It's a yes-man they don't even have to employ, and because they're not domain experts, as per usual, they can't tell that they're being fed a line of bullshit.
I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.
But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.
People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.
Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.
That said, the hype around LLMs certainly overstates their capabilities.
I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.
I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.
Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.
Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
This would be so useful. I have thought about this missing piece a lot.
Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.
This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.
Despite me rejecting the changes and explicitly telling it to ignore the linter it kept insisting on only trying to solve for that
I don't want to drastically change my current code, nor do I like being told to create several new files and numerous functions/classes to solve this problem. I want you to think clearly and be focused on the task and don't get wild! I want the most straightforward approach which is elegant, intuitive, and rock solid.
Not cool, Claude 3.7, not cool.
I have to wonder, wouldn't just writing the code be more productive in the end?
Yes: if you are an expert in the area. In this case I needed something fairly specific I am far from an expert in. I know both Elixir and Rust quite well but couldn't quickly figure out how to be able to wrap a Rust object in just the right container(s) data type(s) so it can be safely accessed from any OS thread even though the object at hand is `Send` but not `Sync`. And I wanted it done without a mutex.
No: because most programming languages are just verbose. Many times I know _exactly_ what I will write 10 minutes later but I still have to type it out. If I can describe it to an LLM well enough then part of that time is saved.
Mind you, I am usually an LLM hater. They are over-glorified, they don't "reason" and they don't "understand" -- it baffles me to this day that an audience seemingly as educated as HN believes in that snake oil.
That being said, they are still a useful tool and as good engineers it's on us to recognize a tool's utility and its strong and weak usages and adapt our workflows to that. I believe me and many others do just that.
The rest... believe in forest nymphs.
So yeah. I agree that a significant part of the time it's just quicker to type it out. But people like myself are good at articulating their needs so with us it's often a coin toss. I choose to type the code out myself more often than not because (1) I don't want to pay for any LLM yet and (2) I don't want to forget my craft which I love to this day and never did it just for the money.
Which does lead to all the weird discourse around them indeed.
Also:
> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.
Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
Why? Because I know so little about chemistry myself that I wouldn't even know what to start asking the model as to be impressed by the answer.
For the model to be useful at all, I would have to learn basic chemistry myself.
Many though I suspect are in this same situation with all subjects. They really don't know much of anything and are therefore unimpressed by the models response in the same way I am not impressed with chemistry responses.
Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?
If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!
No LLM last year got silver. Deepmind had a highly specialized AI system earning that
If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.
It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
Even approximations must be right to be meaningful. If information is wrong, it's rubbish.
Presorting/labelling various data has value. Humans have done the real work there.
What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.
Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.
People who can’t recognize this intentionally have their heads in the sand
1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
"game changing" isn't exactly the sentiment there the last couple months.
Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).
The problem is that up until _very_ recently, it's been possible to get LLMs to generate interesting and exciting results (as a result of all the API documentation and codebases they've inhaled), but it's been very hard to make that usable. I think we need to be able to control the output format of the LLMs in a better way before we can work on what's in the output. I don't konw if MCP is the actual solution to that, but it's certainly an attempt at it...
I think this is where AI is faling short hugely. AI _should_ be able to integrate with IDEs and tooling (e.g. LSP, Treesitter, Editorconfig) to make sure that it's contextually doing the right thin.
But it's not.
The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....
A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.
A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)
In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.
What innovation opens up when AI gets sufficiently commoditized?
For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
Really not seeing the "LLMs not improving" story
It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
awk '/^=> \[Function #/ {
if (out) close(out);
fn = $0; sub(/^.*#/, "", fn); sub(/ .*/, "", fn);
out = "function_" fn ".txt"
}
{ if (out) print > out }' bundle.hasm
Quick example of the output it gave and it's process.But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.
In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.
Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.
Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.
LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.
Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.
"Is Paul Newman known for having had problems with alcohol?"
All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:
"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."
There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.
I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].
I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.
[1] https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...
"AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
0 https://www.google.com/search?q=did+paul+newman+have+a+drink...
"Yes, Paul Newman was widely known for being a heavy drinker, particularly of beer. He himself acknowledged his significant alcohol consumption."
The answer I got (https://gemini.google.com/share/9e327dc4be03) includes references such as https://apnews.com/article/entertainment-reviews-movies-paul... and https://www.psychologytoday.com/us/blog/the-playing-field/20... although they are redacted from the public-sharing link.
"Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
>Paul Newman is indeed known for having struggled with alcohol during his life. Accounts from various sources, including his own memoir and the documentary ... (https://www.perplexity.ai/search/is-paul-newman-known-for-ha...)
I guess there's something about ChatGPT's set up that makes it different? Maybe they wanted it to avoid libeling people?
> Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8e...
It's impressive how often AI returns the right answer to vague questions. (not always though)
Edit: and, more importantly, plenty of people willing to pay a subscription for good quality.
I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
https://chatgpt.com/share/67f332e5-1548-8012-bd76-e18b3f8d52...
Your query indeed answers "...not widely known..."
"Did Paul Newman have problems with alcoholism?"
https://chatgpt.com/share/67f3329a-5118-8012-afd0-97cc4c9b72...
"Yes, Paul Newman was open about having struggled with alcoholism"
What's the issue? Perhaps Paul Newman isn't _famous_ ("known") for struggling with alcoholism. But he did struggle with alcoholism.
Your usage of "known for" isn't incorrect, but it's indeed slightly ambiguous.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
Also, you can/should use the "research" mode for questions like this.
This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
Where is the threshold for topics with enough presence in the data?
An LLM does not care about your question, it is a bunch of math that will spit out a result based on what you typed in.
>Found 24 models: llama3-70b-8192, llama-3.2-3b-preview, meta-llama/llama-4-scout-17b-16e-instruct, allam-2-7b, llama-guard-3-8b, qwen-qwq-32b, llama-3.2-1b-preview, playai-tts-arabic, deepseek-r1-distill-llama-70b, llama-3.1-8b-instant, llama3-8b-8192, qwen-2.5-coder-32b, distil-whisper-large-v3-en, qwen-2.5-32b, llama-3.2-90b-vision-preview, deepseek-r1-distill-qwen-32b, whisper-large-v3, llama-3.3-70b-specdec, llama-3.3-70b-versatile, playai-tts, whisper-large-v3-turbo, llama-3.2-11b-vision-preview, mistral-saba-24b, gemma2-9b-it
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Paul Newman, the renowned actor and humanitarian, did not have a widely publicized
struggle with alcohol addiction throughout most of his life, but there were
specific instances that indicated challenges.
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.It took three minutes for me to do the above from the time I created my API key to when I had an answer.
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself. 1. Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
Apparently it isn’t so specialized that a pretty obvious old fashioned web search on Google wouldn’t immediately return an authoritative source.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
LLMs are literally fundamentally incapable of understanding things. They are stochastic parrots and you've been fooled.
Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
> This is not necessarily useful but it's quite impressive.
I think we mostly agree on this. Cheers.
I'm fairly sure I've never seen a deterministic parrot which makes me think the term is tautological.
I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
Of course, this turned out to be completely false, with advances in understanding of neural networks. Now, again with no evidence other than "we invented this thing that's, useful to us" people have been asserting that humans are just like this thing we invented. Why? What's the evidence? There never is any. It's high dorm room behavior. "What if we're all just machines, man???" And the argument is always that if I disagree with you when you assert this, then I am acting unscientifically and arguing for some kind of magic.
But there's no magic. The human brain just functions in a way different than the new shiny toys that humans have invented, in terms of ability to model an external world, in terms of the way emotions and sense experience are inseparable from our capacity to process information, in terms of consciousness. The hardware is entirely different, and we're functionally different.
The closest things to human minds are out there, and they've been out there for as long as we have: other animals. The real unscientific perspective is that to get high on your own supply and assert that some kind of fake, creepily ingratiating Spock we made up (who is far less charming than Leonard Nimony) is more like us than a chimp is.
Look around you
Look at Skyscrapers. Rocket ships. Agriculture.
If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
Meanwhile all that LLMs do is parrot things that humans created
Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
You are just speaking absolute drivel
Prompt: "Can you give me a URL with some novel components, please?"
DuckDuckGo LLM returns: "Sure! Here’s a fictional URL with some novel components: https://www.example-novels.com/2023/unique-tales/whimsical-j..."
An living parrot echoing "pieces of eight" cannot do this, it cannot say "pieces of <currency>" or "pieces of <valuable mineral>" even if asked to do that. The LLM training has abstracted some concept of what it means for a text pattern to be a URL and what it means for things to be "novel" and what it means to switch out the components of a URL but keep them individually valid. It can also give a reasonable answer asking for a new kind of protocol. So your position hinges on the word "stochastic" which is used as a slur to mean "the LLM isn't innovating like we do it's just a dice roll of remixing parts it was taught". But if you are arguing that makes it a "stochastic parrot" then you need to consider splitting the atom in its wider context...
> "We didn't "parrot" splitting the atom"
That's because we didn't "split the atom" in one blank-slate experiment with no surrounding context. Rutherford and team disintegrated the atom in 1914-1919 ish, they were building on the surrounding scientific work happening at that time: 1869 Johann Hittorf recognising that there was something coming in a straight line from or near the cathode of a Crookes vacuum tube, 1876 Eugen Goldstein proving they were coming from the cathode and naming them cathode rays (see: Cathode Ray Tube computer monitors), and 1897 J.J Thompson proving the rays are much lighter than the lightest known element and naming them Electrons, the first proof of sub-atomic particles existing. He proposed the model of the atom as a 'plum pudding' (concept parroting). Hey guess who JJ Thomspon was an academic advisor of? Ernest Rutherford! 1911 Rutherford discovery of the atomic nucleus. 1909 Rutherford demonstrated sub-atomic scattering and Millikan determined the charge on an electron. Eugen Goldstein also discovered the anode rays travelling the other way in the Crookes tube and that was picked up by Wilhelm Wien and it became Mass Spectrometry for identifying elements. In 1887 Heinrich Hertz was investigating the Photoelectric effect building on the work of Alexandre Becquerel, Johann Elster, Hans Geitel. Dalton's atomic theory of 1803.
Not to mention Rutherford's 1899 studies of radioactivity, following Henri Becquerel's work on Uranium, following Marie Curie's work on Radium and her suggestion of radioactivity being atoms breaking up, and Rutherford's student Frederick Soddy and his work on Radon, and Paul Villard's work on Gamma Ray emissions from Radon.
When Philipp Lenard was studying cathode rays in the 1890s he bought up all the supply of one phosphorescent material which meant Röntgen had to buy a different one to reproduce the results and bought one which responded to X-Rays as well, and that's how he discovered them - not by pure blank-sheet intelligence but by probability and randomness applied to an earlier concept.
That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel. Literally a hundred years and more of scientists and engineers investigating the subatomic world and proposing that atoms could be split, and trying to work out what's in them by small varyations on the ideas and equipment and experiments seen before, you can just find names and names and names on Wikipedia of people working on this stuff and being inspired by others' work and remixing the concepts in it, and we all know the 'science progresses one death at a time' idea that individual people pick up what they learned and stick with it until they die, and new ideas and progress need new people to do variations on the ideas which exist.
No people didn't learn to build rockets from "seeing a volcano" but if you think there was no inspiration from fireworks, cannons, jellyfish squeezing water out to accelerate, no sudies of orbits from moons and planets, no chemistry experiments, no inspiration from thousands of years of flamethrowers: https://en.wikipedia.org/wiki/Flamethrower#History no seeing explosions moving large things, you're living in a dream
Fireworks, cannons, chemistry experiments and flamethrowers are all human inventions
And yes, exactly! We studied orbits of moons and planets. We studied animals like Jellyfish. We choose to observe the world, we extracted data, we experimented, we saw what worked, refined, improved, and succeeded
LLMs are not capable of observing anything. They can only regurgitate and remix the information they are fed by humans! By us, because we can observe
An LLM trained on 100% wrong information will always return wrong information for anything you ask it.
Say you train an LLM with the knowledge that fire can burn underwater. It "thinks" that the step by step instructions for building a fire is to pile wood and then pour water on the wood. It has no conflicting information in its model. It cannot go try to build a fire this way and observe that it is wrong. It is a parrot. It repeats the information that you give it. At best it can find some relationships between data points that humans haven't realized might be related
A human could easily go attempt this, realize it doesn't work, and learn from the experience. Humans are not simply parrots. We are capable of exploring our surroundings and internalizing things without needing someone else to tell us how everything works
> That is, nobody taught humans to split the atom and then humans literally parotted the mechanism and did it, but you attempting to present splitting the atom as a thing which appeared out of nowhere and not remixing any existing concepts is, in your terms, absolute drivel
Building on the work of other humans is not parroting
You outlined the absolute genius of humanity building from first principles all the way to splitting the atom and you still think we're just parroting,
I think we disagree what parroting is entirely.
Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
The author of the post even says this:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
Can't blame people "fact checking" something that's supposed to fill these shoes.
People should be (far) more critical of LLMs given all of these style of bold claims, not less.
Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
Nah.
We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
What happens if you go directly to https://google.com/aimode ?
Its good for broad general overview such as most popular categories of books in the world.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem. But they are just tools that can be used wrong and not a perfect mechanical brain.
# Quick Answer
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
[^1]: [Posthumous memoir reveals Paul Newman's struggle with alcohol](https://www.express.co.uk/entertainment/films/1688113/paul-n...) [^2]: [Review: Paul Newman memoir stuns with brutal honesty - AP News](https://apnews.com/article/entertainment-reviews-movies-paul...) [^3]: [Was Paul Newman Really an Alcoholic? - The Cinemaholic](https://thecinemaholic.com/was-paul-newman-really-an-alcohol...) [^4]: [Who Paul Newman Was—and Who He Wanted to Be](https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...) [^5]: [Paul Newman's Beer Drinking - Psychology Today](https://www.psychologytoday.com/us/blog/addiction-in-society...) [^6]: [Paul Newman's Beer Drinking | Psychology Today United Kingdom](https://www.psychologytoday.com/gb/blog/addiction-in-society...) [^7]: [The troubled life of Paul Newman | The Spectator](https://www.spectator.co.uk/article/the-troubled-life-of-pau...)
'though it had nothing to do with cheating' is a weird inclusion.
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
> Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
> In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
> So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
To that end the LLM could convey as much.
(Anecdotal, sorry:) I was using Claude (not paid) recently and noticed Claude hedging quite a bit when it had not before. Examples:
"Let me be careful about this response since we're discussing a very specific technical detail ..."
"Given how specific that technical detail is, I want to be transparent that while I aim to be accurate, I may hallucinate such precise historical specifications."
I confess my initial reaction was to ask ChatGPT since the answers are more self-assured, ha ha. So perhaps corporate AI are not likely to try and solve this problem of the LLM telling the user when it is on shaky ground. Bad for business.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...
This is likely a manifestation of the bitter lesson[1], specifically this part:
> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.
(Emphasis mine.)
Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.
1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...
Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.
The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.
I asked Grok and others as well. I believe Perplexity was the only one correct.
Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.
2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.
In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.
The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:
"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.
" (a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter (b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . . (b[n]) Where is this likely to fail (c) (Next step that a pentester/cybersecurity researcher would take) "
This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.
> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?
Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.
The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.
LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.
If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.
Suddenly the benchmarks become detached from reality and vendors can claim whatever they want about their "new" products.
Just as a possible explanation, as I feel like I've seen this story before.
Seems like they're looking at how they fail and not considering how they're improving in how they succeed.
The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.
Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
That’s the business model
This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
> NVDA will crash when the AI bubble implodes, > making money, nor will they > They have already hit limiting returns in LLM improvements after staggering investments > and it is clear are nowhere near general intelligence.
These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
Now we get to see if Bitcoin’s use value of 0 is really supporting 1.5 trillion market cap and if OpenAI is really worth $300 billion.
I mean softbank just invested in openai, and they’ve never been wrong, right?
Just because it's not reaching the insane hype being pushed doesn't mean it's totally useless
The same is true for any non essential good or service.
Indeed it can. The difference between a business model and a viable business model is one word - viable.
If you asked me 18 years ago was "giving away a video game and selling cosmetics" a viable business model I would have laughed at you.If you asked me in 2019 I would probably give you money. If you asked me in 2025, I'd probably laugh at you again.
> and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
Or you figure out a way to to sell it to your neighbour for $0.50 and he can sell it on for $1.
The play is clear at every level - Nvidia Sell GPUs, OpenAI sell models, and SAAS sell prompts + UI's. Whether or not any of them are viable remains to be seen. Personally, I wouldn't take the bet.
I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.
Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.
- Make as much money as you can in a 24 hour period doing only legal, moral and non-annoying things
- Make a depressed person who calls a suicide hotline feel really happy doing only legal, moral and non-annoying things
- Do something really useful for society with measurable outcomes doing only legal, moral and non-annoying things
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.
> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Definitely interested to see how the best models from Anthropics competitors do at this.,
Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.
Jesus Christ.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve
Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.
Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.
That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.
Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.
But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.
There were qualitatively leaps in my day-to-day usage:
Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs
Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic. They still made mistake
But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow
It can one shot unusual function with complicated logic and subtle edge cases
I'm this case the goal is to kill all the humans who know a lot about keeping other people safe...
Bingo, but I'd argue this is only scratching the surface of how twisted things are.
A lot of the stuff these labs put out (see: Amodei's cult-like blog ramblings) reeks of what I call "sociopathic utopianism" - essentially, the logical extreme of ends-justified means, made worse in the context of AI labs by the singularity pseudo-religion.
They all truly believe that AGI/ASI is possible, imminent, and could lead to utopia... so achieving that goal will surely outweigh any unsavory acts they commit in the pursuit of it.
This is why I think it's possible OpenAI took out a hit on Suchir; getting bogged down in a legal battle could delay the arrival of their machine god messiah. Same for defrauding benchmarks - they just need a "few more rounds" of investor cash, and by the time those run out, they'll surely have AGI on lock!
Fools. I look forward to them all getting prison time.
After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.
AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.
The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.
As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.
Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.
AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.