Instead of vicreg, they induced their latent state with sparse auto-encoding. Also they predicted in pixel, as opposed to latent, space. The white paper describing their tech is a little bit of a mess, but schematically, at least, the hierarchical architecture they describe bears a strong resemblance to the hierarchical JEPA models LeCunn outlined in his big paper from a few years ago. A notable difference, though, is that their thing is essentially a reflex agent, as opposed to possessing a planning/optimization loop.
Over the last few months I've been inventing this almost exact approach in my head as a hobby without consciously knowing it had already been done. I love their little RC car demo.
How does this compare with existing alternatives? Maybe I'm just lacking proper context, but a minimum 20% failure rate sounds pretty bad? The paper compares their results with older approaches, which apparently had something like a 15% success rate, so jumping to an 80% success rate does seem like a significant jump. If I'm reading the paper correctly, the amount of time required to compute and execute each action went down from 4 minutes to 16 seconds, which also seems significant.
Having to specify an end goal as an image seems pretty limited, but at least the authors acknowledge it in the paper:
> Second, as mentioned in Section 4, V-JEPA 2-AC currently relies upon tasks specified as image goals. Although this may be natural for some tasks, there are other situations where language-based goal specification may be preferable. Extending the V-JEPA 2-AC to accept language-based goals, e.g., by having a model that can embed language-based goals into the V-JEPA 2-AC representation space, is another important direction for future work. The results described in Section 7, aligning V-JEPA 2 with a language model, may serve as a starting point.
I think it would be interesting if the authors answered whether they think there's a clear trajectory towards a model that can be trained to achieve a >99% success rate.
You train a VLA (vision language action) model for a specific pair of robotic arms, for a specific task. The end actuator actions are embedded in the model (actions). So let's say you train a pair of arms to pick an apple. You cannot zero shot it to pick up a glass. What you see in demos is the result of lots of training and fine tuning (few shot) on specific object types and with specific robotic arms or bodies.
The language intermediary embedding brings some generalising skills to the table but it isn't much. The vision -> language -> action translation is, how do I put this, brittle at best.
What these guys are showing is a zero shot approach to new tasks in new environments with 80% accuracy. This is a big deal. Pi0 from Physical Intelligence is the best model to compare I think.
Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.
All statistical models of the kind in use are interpolations through historical data -- there's no magic. So when you interpolate through historical texts, your model is of historical text.
Text is not a measure of the world, to say, "the sky is blue" is not even reliably associated with the blueness of the sky, let alone that the sky isnt blue (there is no sky, and the atmosphere isn't blue).
These models appear "capture more" only because when you interpret the text you attribute meaning/understanding to it as the cause of its generation -- but that wasnt the cause, this is necessarily an illusion. There is no model of the world in a model of historical text -- there is a model of the world in your head which you associate with text, and that association is exploited when you use LLMs to do more than mere syntax transformation.
LLMs excel most at "fuzzy retrieval" and things like coding -- the latter is principally a matter of syntax, and the former of recollection. As soon as you require the prompt-completion to maintain "semantic integrity" with non-syntactical/retrivable constraints, it falls apart.
One other nitpick is that you confine to "historical data", although other classes of data are trained on such as simulated and generative.
Generalisation is the opposite process, hypothecating a universal and finding counter-examples to constrain the universal generalisaton. Eg., "all fire burns" is hypotheticated by a competent animal upon encountering fire once.
Inductive "learners" take the opposite approach: fire burns in "all these cases", and if you have a case similar to those, then fire will burn you.
They can look the same within the region of interpolation, but look very different when you leave it: all of these systems fall over quickly when more than a handful of semantic constraints are imposed. This number is a measure of the distance from the interpolated boundary (e.g., consider this interpretation of apple's latest paper on reasoning in LLMs: the "environment complexity" is nothing other than a measure of interpolation-dissimilarity).
Early modern philosophers of science were very confused by this, but it's in Aristotle plain-as-day, and it's also extremely well establish since the 80s as the development of formal computational stats necessitated making this clear: interpolation is not generalisation. The former does not get you robustness to irrelevant permuation (ie., generalisation); it does not permit considering counterfactual scenarios (ie., generalisation); it does not give you a semantics/theory of the data generating process (ie., generalisation, ie. a world model).
Interpolation is a model of the data. Generalisation requires a model of the data generating process, the former does not give you the latter, though it can appear to under strong experimental assumptions of known causal models.
Here LLMs model the structure of language-as-symbolic-ordering, that structure "in the interpolated region" expresses reasoning, but it isnt a model of reasoning. It's a model of reasoning as captured in historical cases of it.
However the whole issue of othello is a nonsequiteur which indicates that people involved here don't really seem to understand the issue, or what a world model is.
A "world model" is a model of a data generating process which isn't reducible-to or constituted by its measures. Ie., we are concerned for the case where there's a measurement space (eg., that of the height of mercury in a thermometer) and a target property space (eg., that of the temperature of the coffee). So that there is gap between the data-as-measure and its causes. In language this gap is massive: the cause of my saying, "I'm hungry" may have nothing to do with my hunger, even if it often does. For "scientific measuring devices", these are constructed to minimize this gap as much as possible.
In any case, with board games and other mathematical objects, there is no gap. The data is the game. The "board state" is an abstract object constituted by all possible board states. The game "is made out of" its realisations.
However the world isnt made out of language, nor coffee made out of thermometers. So a model of the data isnt a mdoel of its generating process.
So whether an interpolation of board states "fully characterises", someway, an abstract mathematical object "the game" is so irrelevant to the question it betrays a fundamental lack of understanding of even what's at issue.
No one is arguing that a structured interpolative model (ie., one given an inductive bias by an NN architecture) doesn't express properties of the underlying domain in its structure. The question is what happens to this model of the data when you have the same data generating process, but you arent in the interpolated region.
This problem is, in the limit of large data, impossible for abstract games by their nature, eg., a model classifying the input X into legal/illegal board states is the game.
Another way of phrasing this is that in ML/AI textbooks often begin by assuming there's a function you're approximating. But in the vast majority of cases where NNs are used, there is no such function -- there is no function tokens -> meanings (eg., "i am hungry" is ambigious).
But in the abstract math case there is a function, {boards} -> Legal|Illegal is a function, there are no ambiguous boards
So: of the infinite number of f* approximations to f_game, any is valid in the limit len(X) -> inf. Of the infinite number f*_lang to f_language, all are invalid (each in their own way).
So is V-JEPA 2 actually generating a world model, as you've defined it here? Its still just sampling data - visual data, tactile feedback etc is all reducible to quantized data. It seems like you could build useful models that seem to generalize without that. For example, a model could learn to stop dropping things without ever developing a theory of gravity.
Probably I'm still misunderstanding too much for this to be useful, but what I've read from you in this thread is way more useful to my understanding than what I've seen before.
While they may not be world models under my definition above, they are something like world-model-generating-models. They work like our sensory-motor system which itself builds "procedural proxy models" of the world -- and these become world models when they are cognised (, conceptualised, made abstract, made available for the imagination, etc.).
Contrast a very simple animal which can move a leaf around vs., a more complex one (eg., a mouse, etc.) which can imagine the leaf in various orientations. It's that capacity, esp. of mammals (, birds, etc.) to reify their sensory-motor "world-model-generating" capacity, eg., in imagination, which allows them to form world models in their heads. We require something like imagination in order to be able to hypotheticate a general model, form a hypothetical action, and try that action out.
I'm less concerned about making this distinct clear for casual observes in the case of robotics, because imv, competent acting in the world can lead to building world models. Whereas most other forms cannot.
What these robots require, to have world models in my view, would be firstly these sensory-motor models and then a reliable way of 1) acquiring new SM mdoels live (ie., learning motor techniques); and 2) reporting on what they have learned in a reasoning/cognitive context.
Robotics is just at stage0 here, the very basics of making a sensory-motor connection.
This too could form the basis of a productive skepticism towards the usefulness of coding agents, unlike what has caught attention here. (Referring specifically to the post by tptacek)
For example, we could look at feedback from the lisp community (beyond anecdata) on the usefulness of LLMs? Since it's what one might call "syntax-lite", a lack of true generalization ability ("no possible world model for an unavoidably idiosyncratic DSL-friendly metalanguage") could show up as a lack of ability to not just generate code, but even to fix it..
Beyond that, the issue how much the purported world-shattering usefulness of proof assistants based on say, Lean4, must depend on interpolating say, mathlib..
In short, please link the papers :)
>There are two follow up papers showing the representations are "entangled", a euphemism for statistical garbage, but I can't be bothered at the moment to find them.
a token & misguided attempt to surface relevant lit might incite you to shove the obviousness down my throat :)
Here's one for disentangling reps in LLMs https://arxiv.org/abs/2505.18774v1
Generalisation in the popular sense (science, stats, philosophy of science, popsci) is about reliability and validity, def. validity = does a model track the target properties of a system we expect; reliability = does it continue to do so in environments in which those features are present, but irrelevant permutations are made.
Interpolation is "curve fitting", which is almost all of ML/AI. The goal of curve fitting is to replace a general model with a summary of the measurement data. This is useful when you have no way of obtaining a model of the data generating process.
What people in ML assume is that there is some true distribution of measurements, and "generalisation" means interpolating the data so that you capture the measurement distribution.
I think it's highly likely there's a profound conceptual mistake in assuming measurements themsleves have a true distribution, so even the sense of generalisation to mean "have we interpolated correctly" is, in most cases, meaningless.
Part of the problem is that ML textbooks frame all ML problems with the same set of assumptions (eg., that there exists an f: X->Y, that X has a "true distribution" Dx, so that finding f* implies learning Dx). For many datasets, these assumptions are false. Compare running a linear regression on photos of the sky, through stars to get star signs, vs. running it on V=IR electric circuit data to get `R`
In the former cases, there is no f_star_sign to find; there is no "true distribution" of star sign measurements; etc. So any model of star signs cannot be a model even of measurements of star signs. ML textbooks do not treat "data" as having these kinds of constraints, or relationships to reality, which breads pseudoscientific and credulous misunderstandings of issues (such as, indeed, the othello paper).
This is painfully accurate.
The conversations go like this:
Me: “guys, I know what I’m talking about, I wrote my first neural network 30 years ago in middle school, this tech is cool but it isn’t magic and it isn’t good enough to do the thing you want without getting us sued or worse.”
Them: “Bro, I read a tweet that we are on the other side of the singularity. We have six months to make money before everything blows up.”
This learning technology didn't exist until this moment in time. That probably has more to do with why no one is using it in the wild.
Or the robot is supervised all the time.
Or just operates in an area without humans.
But so far this is research, not market ready.
For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.
Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.
There's then evidence of what's called Predictive Coding. When that future happens, a higher level circuit decides how far off we were, and then releases appropriate neuromodulators to re-wire that circuit.
That would mean that to learn faster, you want to expose yourself to situations where you are often wrong: be often surprised and go down the wrong paths. Have a feedback mechanism which will tell you when you're wrong. This is maybe also why the best teachers are the ones who often ask the class questions for which there are counter-intuitive answers.
Yes, and ideally there would be whole backpropagation passes which update the entire model depending on how much the current observation diverges from past predictions. (Though brains use an updating mechanism which diverges from the backpropagation algorithm.)
Edit: Apparently the theory of this is broadly known (apart from "JEPA" and "predictive coding") also under the names "free energy principle" and "active inference": https://en.wikipedia.org/wiki/Free_energy_principle
An LLM takes in input, transforms it into an embedding, and makes predictions off that embedding. The only high-level difference I can see is that currently LLMs do it in a "single pass" where they output tokens directly (and COT is sort of a hack to get reasoning by "looping" in autoregressive output token space), but IIRC there are some experimental variants that do looped latent reasoning.
Any high-level comparison I can find almost strawmans LLMs: yes they take in token embeddings directly, but the first few layers of an LLM almost surely convert that to more abstract embeddings, as seen in repE research. Since the best way to predict is to actually internalize a world model, there's no reason to believe that multimodal LLMs can't make predictions about physical changes in the same way that JEPA claims to. That said JEPA may be able to do it more efficiently, attention almost surely isn't the _optimal_ architecture for doing all this
But an analogous pretraining approach isn't available for robotics. Robots take in sensory data and return movements, in real-time. There is no large data corpus of this pairing to do self-supervised learning on, like there is for text.
Even if we only consider pure video-to-video models, for which there is a large amount of training data for self-supervised learning, the autoregressive next-token predictor approach wouldn't work. That's why Veo 3 & Co are diffusion models. Because predicting the next frame directly doesn't work. It's far too much data. Text comes in relative tiny, discrete amounts with high useful information content per bit. Video is huge, basically continuous, and has quite low useful information content per bit (because of things like irrelevant details and noise), at least as far as robotics is concerned.
Moreover, even if next frame-prediction would work, this doesn't really do what we want for robotics. The robot doesn't just need a prediction about the next frame (or embedding of the next frame) when planning its movements, but potentially broadly about the next millions of frames, about things that are much further out in the future.
But the residual stream of LLMs doesn't "just" encode the next token prediction, it is high-level enough to encode predictions for a few tokens out, as seen with things like Multi-token prediction.
But yes I can see that in terms of input, you probably don't want to take in video frames directly and training via teacher-forcing is probably inefficient here. So some world-model-tailored embedding like JEPA is probably better. I guess my confusion is that Yann seems to frame it as JEPA vs LLM, but to me JEPA just seems like an encoder to generate embeddings that can be fed into an LLM. They seem complementary rather than a substitute.
This is easily generated synthetically from a kinematic model, at least up to a certain level of precision.
It did work for AlphaGo Zero (and later AlphaZero), which were entirely trained on synthetic data. But that's for very simple games with strict formal rules, like Go and chess.
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
BTW I'm very much not an expert here and I'm just trying to understand how this system works end to end. Don't take anything I write here as authoritative.
https://ai.meta.com/research/publications/sonar-sentence-lev...
That's very much an unsolved problem, and I don't know how far Meta is along that path. Not very far, I assume.
> V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
> After the actionless pre-training stage, the model can make predictions about how the world might evolve—however, these predictions don’t directly take into account specific actions that an agent would take. In the second stage of training, we focus on making the model more useful for planning by using robot data, which includes visual observations (video) and the control actions that the robot was executing. We incorporate this data into the JEPA training procedure by providing the action information to the predictor. After training on this additional data, the predictor learns to account for specific actions when making predictions and can then be used for control. We don’t need a lot of robot data for this second phase—in our technical report, we show that training with only 62 hours of robot data already results in a model that can be used for planning and control.
> We demonstrate how V-JEPA 2 can be used for zero-shot robot planning in new environments and involving objects not seen during training. Unlike other robot foundation models—which usually require that some training data come from the specific robot instance and environment where the model is deployed—we train the model on the open source DROID dataset and then deploy it directly on robots in our labs. We show that the V-JEPA 2 predictor can be used for foundational tasks like reaching, picking up an object, and placing it in a new location.
> For short-horizon tasks, such as picking or placing an object, we specify a goal in the form of an image. We use the V-JEPA 2 encoder to get embeddings of the current and goal states. Starting from its observed current state, the robot then plans by using the predictor to imagine the consequences of taking a collection of candidate actions and rating the candidates based on how close they get to the desired goal. At each time step, the robot re-plans and executes the top-rated next action toward that goal via model-predictive control. For longer horizon tasks, such as picking up an object and placing it in the right spot, we specify a series of visual subgoals that the robot tries to achieve in sequence, similar to visual imitation learning observed in humans. With these visual subgoals, V-JEPA 2 achieves success rates of 65% – 80% for pick-and-placing new objects in new and unseen environments.
where
y - full video, x - masked video, E_θ(.) - learned encoder (semantic embedding), P_φ(.) - learned predictor, Δ - learned mask (which patches in a video where dropped), sg(.) - stop gradient to prevent change, gradient propagation in E_θ'(.), which in turn is an exponential moving average of E_θ(.) ie. θ'_new <- τ θ'_old + (1-τ) θ. So the loss is applied only to the predictions of the masked patches while the encoder of full video follows the learned one. This asymmetry in learning prevents collapse of the encoder to a trivial constant.
It's one of those ideas I've had around for a while that if you fused decent object tracking with an understanding of Verlet integration you should, in principle, start being able to measure all sorts of physical quantities quite easily.
As usual comparisons with humans provide little practical insight for what's achievable with ML. Humans don't have to learn everything from scratch like ML models do, you aren't expecting ML models to learn language out of a few thousands of tokens just because humans can, so similarly you shouldn't expect neural networks to learn reasoning from world interaction alone.
I mean, it still takes them much more time than it takes to train even the largest LLMs we use (a couple months)
Human: 1000 tok * 60 * 86400 * 365 = 2 Trillion tokens / year
GPT-4: 13 Trillion tokens
Llama-3: 15 Trillion tokens
Between the non-uniformity of receptor density (eg fovea vs peripheral vision but this is general across all senses), dynamic receptor fields and the fact that information is encoded in terms of spike rate and timing patterns across neural populations, the idea of pixels in some bitmap at some resolution is beyond misleading. There is no pixel data, just sparsely coded feature representations capturing things like edges, textures, motion, color contrast and the like, already, at the retina.
While hundreds of trillions of photons might hit our photoreceptors, > 99% of that is filtered and or compressed before even reaching retinal ganglion cells. Only a tiny fraction, about 10 million bits/sec, of the original photon signal rate is transferred through the optic nerve (per eye). This pattern of filtering and attentive prioritization of information in signals continues as we go from sensory fields to thalamus to higher cortical areas.
So while we might encounter factoids like: on the order of a billion bits per second of data hit photoreceptors or [10Mb/s transferred](https://www.britannica.com/science/information-theory/Physio...) along optic nerves, it's important to keep in mind that a lot of the intuition gained from digital information processing does not transfer in any meaningful sense to the brain.
J.E.P.A.
Those models don't have any understanding of physics, they just regurgitate what they see in their vision-based training set, just like any image or video generation model does.
Monkey see other monkey cannot go through wall, monkey don't try go through wall.
Of course these models are not understanding physics in the way a physicists or a mathematician would. But they do form a model of the world that can be used for forecasting and reasoning, in a way possibly not much unlike how humans and other animals operate when interacting with the physical world.
I mean... we are just monkeys. Did we not learn this way when we were younger?
These models/robots aren't superintelligent by any means, but "Monkey see other monkey cannot go through wall, monkey don't try go through wall" isn't far off from how some animals/humans "learn".
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.