For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.
By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)
I don't shy away from looking stupid in the future, you've got to give me this much
(For what it's worth, I respect and greatly appreciate your willingness to put out a prediction based on real evidence and your own reasoning. But I think you must be lacking experience with the latest tools & best practices.)
Indeed I have no experience with Claude Code, but I use Claude via chat, and it fails all the time on things not remotely as hard as orientation in a large code base. Claude Code is the same thing with the ability to run tools. Of course tools help to ground its iterations in reality, but I don't think it's a panacea absent a consistent ability to model the reality you observe thru your use of tools. Let's see...
That being said, I think your analysis is 100% correct. LLMs are fundamentally stupid beyond belief :P
Where is the corpus of SwiftUI animations to train Claude what probable soup you probably want regurgitated?
Hypothesis: iOS devs don't share their work openly for reasons associated with how the App Store ecosystem (mis)behaves.
Relatedly, the models don't know about Swift 6 except from maybe mid-2024 WWDC announcements. It's worth feeding them your own context. If you are 5.10, great. If you want to ship iOS 26 changes, wait till 2026 or again, roll your own context.
These are not even remotely similar, despite the name. Things are moving very fast, and the sort of chat-based interface that you describe in your article is already obsolete.
Claude is the LLM model. Claude Code is a combination of internal tools for the agent to track its goals, current state, priorities, etc., and a looped mechanism for keeping it on track, focused, and debugging its own actions. With the proper subagents it can keep its context from being poisoned from false starts, and its built-in todo system keeps it on task.
Really, try it out and see for yourself. It doesn't work magic out of the box, and absolutely needs some hand-holding to get it to work well, but that's only because it is so new. The next generation of tooling will have these subagent definitions auto selected and included in context so you can hit the ground running.
We are already starting to see a flood of software coming out with very few active coders on the team, as you can see on the HN front page. I say "very few active coders" not "no programmers" because using Claude Code effectively still requires domain expertise as we work out the bugs in agent orchestration. But once that is done, there aren't any obvious remaining stumbling blocks to a PM running a no-coder, all-AI product team.
It's obvious LLMs can't do the job without these external tools, so the claim above - that LLMs can't do this job - is on firm ground.
But it's also obvious these hybrid systems will become more and more complex and capable over time, and there's a possibility they will be able to replace humans at every level of the stack, from junior to CEO.
If that happens, it's inevitable these domain-specific systems will be networked into a kind of interhybrid AGI, where you can ask for specific outputs, and if the domain has been automated you'll be guided to what you want.
It's still a hybrid architecture though. LLMs on their own aren't going to make this work.
It's also short of AGI, never mind ASI, because AGI requires a system that would create high quality domain-specific systems from scratch given a domain to automate.
Whether you draw the system boundary of an LLM to include the tools it calls or not is a rather arbitrary distinction, and not very interesting.
This isn't being pedantic, it's deliberately misinterpreting a commonly used term by taking every word literally for effect. Terms, like words, can take on a meaning that is distinct from looking at each constituent part and coming up with your interpretation of a literal definition based on those parts.
https://www-formal.stanford.edu/jmc/generality.pdf
Or look at the old / early AGI conference series:
Or read any old, pre-2009 (ImageNet) AI textbook. It will talk about "narrow intelligence" vs "general intelligence," a dichotomy that exists more in GOFAI than the deep learning approaches.
Maybe I'm a curmudgeon and this is entering get-off-my-lawn territory, but I find it immensely annoying when existing clear terminology (AGI vs ASI, strong vs weak, narrow vs. general) is superseded by a confused mix of popular meanings that lack any clear definition.
I looked at the AGI conference page for 2009: https://agi-conference.org/2009/
When it uses the term "artificial general intelligence", it hyperlinks to this page: http://www.agiri.org/wiki/index.php?title=Artificial_General...
Which seems unavailable, so here is an archive from 2007: https://web.archive.org/web/20070106033535/http://www.agiri....
And that page says "In Nov. 1997, the term Artificial General Intelligence was first coined by Mark Avrum Gubrud in the abstract for his paper Nanotechnology and International Security". And here is that paper: https://web.archive.org/web/20070205153112/http://www.foresi...
That paper says: "By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed."
I think that your insisting that AGI means something different than what everyone else means when they say it is not useful, and will only lead to people getting confused and disagreeing with you. I agree that it's not a great term.
Goalposts are moving though. Through the efforts of various people in the rationalist-connected space, the word has since morphed to be implicitly synonymous with the notion of superintellgence and self-improvement, hence the vague and conflicting definitions people now ascribe to it.
Also, fwiw the training process behind the generation of an LLM is absolutely able to discover new and novel ideas, in the same sense that Kepler’s laws of planetary motion were new and novel if all you had were Tycho Brache’s astronomical observations. Inference can tease out these novel discoveries, if nothing else. But I suspect also that your definition of creative and novel would also exclude human creativity if it were rigorously applied—our brains after all are merely remixing our own experiences too.
Monorepos are large but the projects inside may, individually, not be that complex. So there are ways of making LLMs work with monorepos well (eg; providing a top level index of what's inside, how to find projects, and explaining how the repo is set up). Complexity within an individual project is something current-gen SOTA LLMs (I'm counting Sonnet 4, Opus 4.1, Gemini 2.5 Pro, and GPT-5 here) really suck at handling.
Sure, you can assign discrete little tasks here and there. But bigger efforts that require not only understanding how the codebase is designed but also why it's designed that way fall short. Even more so if you need them to make good architectural decisions on something that's not "cookie cutter".
Fundamentally, I've noticed the chasm between those that are hyper-confident LLMs will "get there soon" and those that are experienced but doubtful depends on the type of development you do. "ticket pulling" type work generally has the work scoped well enough that an LLM might seem near-autonomous. More abstract/complex backend/infra/research work not so much. Still value there, sure. But hardly autonomous.
This seems quite doable with even a small amount of tooling around Claude Code, even though I agree it doesn't have this capability out of the box. I think a large part of this gulf is "it doesn't work out of the box" vs "it can be made to work with a little customization."
Amazingly-written software is necessary for LLMs to work well, but it isn't sufficient: LLMs tend to make nonsensical changes that, while technically implementing what they're asked to do (much of the time), reduce the quality of the software. As this repeats, the LLMs become less and less able to modify the program. This is because they can't program: they can translate, plagiarise, and interpolate, but they're missing several key programming skills, and probably cannot learn them.
I work on codebases that you could describe as 'large', and you could describe some of the LLM driven work being done on them as 'autonomous' today.
time to prove hypothesis: infinity years
Similar, Newton's laws say that bodies always stay at rest unless acted upon by a force. Strictly speaking, if a billiard ball jumps up without cause tomorrow that would disprove Newton. So we'd have to wait an infinite amount of time to prove Newton right.
However no one has to wait so long, and we found ways to express how Newton's ideas are _better_ than those of Aristotle without waiting an eternity.
Absolutely nothing about that statement is concrete or falsifiable.
Hell, you can already deal with large code bases 'autonomously' without LLMs - grep and find and sed goes a long way!
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)
Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.
Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.
For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.
Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.
Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
Anyway thanks for writing this and responding!
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
Which says to me there are two camps on this and the verdict is still out on this and all related questions.
I think "compell" is such a unique human trait that machine will never replicate to the T.
The article did mention specifically about this very issue:
"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."
If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
https://www.youtube.com/live/TknN8fCQvRk
[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:
https://youtube.com/watch?v=HB5TrK7A4pI
[3] Google OR-Tools:
https://developers.google.com/optimization
[4] MiniZinc:
If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.
LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.
Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.
And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.
Theres 2 AI conversations on HN occurring simultaneously.
Convo A: Is it actually reasoning? does it have a world model? etc..
Convo B: Is it good enough right now? (for X, Y, or Z workflow)
It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.
AlphaZero specifically did not use human training data at all.
I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.
What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.
Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.
Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.
> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]
Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.
A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".
Or how many red pixels are in a visible icon.
We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.
Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.
Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:
1. How many n's are in 日本?
2. How many ん's are in 日本?
(Answers are 2 and 1.)
Is this a real defect, or some historical thing?
I just asked GPT-5:
How many "B"s in "blueberry"?
and it replied: There are 2 — the letter b appears twice in "blueberry".
I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.
Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?
The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.
https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...
Perhaps they have a hot fix that special cases HN complaints?
My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.
Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.
It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.
To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.
https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
People make up new questions for each IMO.
OpenAI got flamed over announcing their results before the embargo was up:
IMO had asked companies to wait at least a week or so after the human winners were announced to announce the AI results. OpenAI did not wait.
Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.
Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)
And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.
And, humans too can act without a consistent world model.
https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gp...
First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one
Second, the probing is quite... a subjective thing.
We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.
Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.
This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".
From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"
These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")
Why are you saying 'these things'?. That statement is about a specific model which did play at that level and did not lose track of the pieces. There's no cheating or weirdness.
I don't believe that it is a fundamentally incorrect approach. I believe, that human mind does something like that all the time, the difference is our minds have some additional processes that can, for example, filter hallucinations.
Kids at a specific age range are afraid of their imagination. Their imagination can place a monster into any dark place where nothing can be seen. Adult mind can do the same easily, but the difference is kids have difficulties distinguishing imagination and perception, while adult generally manage.
I believe, the ability of human mind to see difference between imagination/hallucinations from one hand and perception and memory from the other is not a fundamental thing stemming from the architecture of brains but a learned skill. Moreover people can be tricked to acquire false memory[1]. If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.
What LLMs need is to learn some tricks to detect hallucinations. Probably they will not get 100% reliable detector, but to get to the level of humans they don't need 100% reliability.
When the child is afraid of the monster in the dark, they are not literally visually hallucinating a beast in the dark; they are worried that there could be a beast in the dark, and they are not sure that there is due to a lack of sensory information confirming a lack of the monster. They are not being hyper precise because they are 3, so they say "there is a monster under my bed"! Children have instincts to be afraid of the dark.
Similarly with imaginary friends and play, it's an instinct to practice through smaller stakes simulations. When they are emotionally attached to their imaginary friends, it's much like they are emotionally attached to their security blanket. They know that the "friend" is not perceptible.
It's much like the projected anxieties of adults or teenagers, who are worried that everyone thinks they are super lame and thus act like people do, because on the balance of no information, they choose the "safer path".
That is pretty different than the hallucinations of LLMs IMO.
The simplest one is fight/flight/freeze. Brain starts the process by being afraid, and hormones gets released, but next step is triggered by the nerve feedback coming from the body. If you are using beta-blockers and can't get panicked, the initial trigger fizzles and you return to your pre-panic state.
an LLM doesn't model a complete body. It just models the language. It's just a very small part of what brain handles, so assuming that modelling the language, even the whole brain gonna answer all the questions we have is a flawed approach.
Latest research shows body is a much more complicated and interconnected system than we learnt in school 30 years ago.
The fundamental foundation of science and engineering is reliability.
If you start saying reliability doesn't matter, you're not doing science and engineering any more.
> The point is that knowledge/language can't work reliably unless it's grounded in something outside of itself.
Just, what? Knowledge is facts, somehow held within a system allowing recall and usage of those facts. Knowledge doesn't have a 'self', and I'm totally not understanding how pure knowledge as a concept or medium needs "grounding"?
Being charitable, it sounds more like you're trying to describe "wisdom" - which might be considered as a combination of knowledge, lived experience, and good judgement? Yes, this is valuable in applying knowledge more usefully, but has nothing to do with the other bodily systems which interact with the brain, which is where you started?
> The fundamental foundation of science and engineering is reliability.
> If you start saying reliability doesn't matter, you're not doing science and engineering any more.
No-one mentioned reliability - not you in your original post, or me in my reply. We were discussing whether the various (unconscious) systems which link to the brain in the human body (like the gut:brain axis) might influence its knowledge/language/interpretation abilities.
Can we repeat the feat of Archimedes? Yes, we can, but first we'd have to forget what we were told and taught.
The way we actually discover things is very different from amassing lots of hearsay. Indeed, we do have an internal part that behaves the same way LLM does. But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate. That part does not help us think; it helps us to avoid thinking. The reason it exists is that it is useful: thinking is hard and slow, but recalling is easy and fast. But it not thinking; it is the opposite.
Close, but not exactly. To start from a clean slate is not very difficult, the trick is to reject some chosen parts of existing knowledge, or more specifically the difficulty is to choose what to reject. Starting from a clean slate you'll end up spending millennia to get the knowledge you've just rejected.
So the overall process of generating knowledge is to look under the streetlight till finding something new becomes impossible or too hard, and then you start experimenting with rejecting some bits of your knowledge to rethink them. I was taught to read works of Great Masters of the past critically, trying to reproduce their path while looking for forks where you can try to go the other way. It is a little bit like starting from a clean slate, but not exactly.
She's strongly oversold how and when false memories can be created. She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.
She's spent a career over-generalizing data about implanting false minor memories to make money discrediting victims' traumatic memories and defend abusers.
You conflate "hallucination" with "imagination" but the former has much more in common with lieing than it does with imagining.
Did she have financial incentives? Was this a live demonstration? :P
Absolutely not. Human brains have online one-shot training. LLMs weights are fixed and fine-tuning them is a huge multi-year enterprise.
Fundamentally it's two completely different architectures.
I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)
And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.
Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)
> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.
> is a purely philosophical question
It is indeed. A question we need to ask ourselves.
If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).
I have never understood the stochastic parrot interpretation. LLMs (and general deep learning models) are not statistical/stochastic based models. Statistics trivially apply, as they apply to all measurements of judge-able behavior. But the models do not perform statistical operations, nor do their architectures form tunable statistically driven systems.
They learn topological representations of relationships. Entirely different from statistics/stochastics.
--
Within their "style" of cognition, LLMs are very creative. They readily propose solutions to problems involving uncommon or unique combinations of disparate topics.
Coming up with artificial examples is easy (and they come up naturally for me all the time).
I think the best characterization of LLM knowledge, reasoning and creativity is: extremely wide (in ability to weave topics and communication constraints - one shot), but somewhat shallow (not being able to reason too deep.)
Within those bounds, they far far exceed human capabilities.
0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?
1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.
2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.
You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.
https://arxiv.org/abs/2501.17186
PS "Major commercial American LLM" is not very meaningful, you could be using GPT4o with that description.
That means it does not build a mirror of a system based on its interactions.
It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.
It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.
> but because I know you and I get by with less.
Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.
> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.
Which I find important because, well, hallucinating facts is what you would expect from an LLM, but isn't necessarily inherent issue with machine intelligence writ large if it's trained from the ground up on different principles, or modelling something else. We use LLMs as a stand in for tutors because being really good at language incidentally makes them able to explain math or history as a side effect.
Importantly it doesn't show that hallucinating is a baked in problem for AI writ large. Presumably different models will have different kinds of systemic errors based on their respective designs.
When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.
Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.
But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.
In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.
In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.
This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.
The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.
I've graded many exams in my university days (and set some myself), and it's exceedingly obvious that that's what many students are doing. I do wonder though how often they manage to fly under the radar. I'm sure it happens, as you described.
(This is also the reason why I strongly believe that in exams where students write free-form answers, points should be subtracted for incorrect statements even if a correct solution is somewhere in the word salad.)
Even when it was right the first time!
https://deepmind.google/discover/blog/genie-3-a-new-frontier...
We also have multimodal AIs that can do both language and video. Genie 3 made multimodal with language might be pretty impressive.
Focusing only on what pure language models can do is a bit of a straw man at this point.
Do: use LLMs to talk shit to you while a real chess AI plays chess against you.
The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.
Why would anyone choose to awkwardly play using natural language rather than a reliable, fast and intuitive UI?
I think this is quite an amusing idea, as the LLM would see the moves the chess engine made and comment along the lines of "wow, I didn't see that one coming!" very Roger Sperry.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.
I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.
It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.
Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?
Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.
In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?
Answer (after choosing thinking mode)
No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.
It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works
Shows that even if you have a world model, it might not be the right one.
https://g.co/gemini/share/362506056ddb
Time to get the ol' goalpost-moving gloves out.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.
But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.
If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?
This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.
Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!
Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.
But, “caused me to say” is a relation, is it not?
Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.
It's like trying to describe a color to a blind person: poetic subjective nonsense.
I don’t think describing colors to a blind person is nonsense. One can speak of how the different colors relate to one-another. A blind person can understand that a stop sign is typically “red”, and that something can be “borderline between red and orange”, but that things will not be “borderline between green and purple”. A person who has never had any color perception won’t know the experience of seeing something red or blue, but they can still have a mental model of the world that includes facts about the colors of things, and what effects these are likely to have, even though they themselves cannot imagine what it is like to see the colors.
You exist in the full experience. That lossy projection to words is still meaningful to you, in your reading, because you know the experience it's referencing. What do I mean by "lossy projection"? It's the experience of seeing the color blue to the word "blue". The word "blue" is meaningless without already having experienced it, because the word is not a description of the experience, it's a label. The experience itself can't be sufficiently described, as you'll find if you try to explain a "blue" to a blind person, because it exists outside of words.
The concept here is that something like an LLM, trained on human text, can't having meaningful comprehension of some concepts, because some words are labels of things that exist entirely outside of text.
You might say "but multimodal models use tokens for color!", or even extending that to "you could replace the tokens used in multimodal models with color names!" and I would agree. But, the understanding wouldn't come from the relation of words in human text, it would come from the positional relation of colors across a space, which is not much different than our experience of the color, on our retina
tldr: to get AI to meaningful understand something, you have to give it a meaningful relation. Meaningful relations sometimes aren't present, in human writing.
First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?
What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).
> If we accept the incompleteness theorem
There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.
Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.
That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.
Wouldn't physics still "exist" even if there were an infinite set of laws?
"We could learn to sail the oceans and discover new lands and transport cargo cheaply... But in a few centuries we'll discover we were wrong and the Earth isn't really a sphere and tides are extra-complex so I guess there's no point."
.000000000000001% of infinity is still infinite.
That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.
The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.
What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.
> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).
It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.
Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...
Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.
Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?
I am hearing the term super intelligence a lot and it seems to me the only form that would take is the machine spitting out a bunch of symbols which either delight or dismay the humans. Which implies they already know what it looks like.
If this technology will advance science or even be useful for everyday life, then surely the propositions it generates will need to hold up to reality, either via axiomatic rigor or empirically. I look forward to finding out if that will happen.
But it's still just a movement from the known to the known, a very limited affair no matter how many new symbols you add in whatever permutation.
Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?
I was following the string of questions, but I think there is a logical leap between those two questions.
Another question: is Language the only way to define models? An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.
Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.
This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).
But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.
Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.
Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.
————
“Can you write a test suite for it?”
Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).
It has been noted for several years in US national labs and elsewhere that there is an almost perfect overlap between data models LLMs are poor at learning and data models that we struggle to index at scale. If LLMs were actually good at these things then there would be a straightforward path to addressing these longstanding non-AI computer science problems.
The incompleteness is that the LLM tech literally can't represent elementary things that are important enough that we spend a lot of money trying to represent them on computers for non-AI purposes. A super-intelligent AGI being right around the corner implies that we've solved these problems that we clearly haven't solved.
Perhaps more interesting, it also implies that AGI tech may look significantly different than the current LLM tech stack.
And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.
Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.
Wouldn't it become harder to simulate a human brain the larger a machine is? I don't know nothing, but I think that peaky speed of light thing might pose a challenge.
[1]: https://www.experimental-history.com/p/you-cant-reach-the-br...
There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.
Where do humans get new insights from?
Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.
https://dynomight.net/more-chess/
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...
https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..
..though the potential for catastrophic disasters is also great there lol