← Back to Blog
AI Architecture · World Models · JEPA

Yann LeCun's Bet That Intelligence Starts in the World

JEPA is not a new chatbot trick. It is a wager that the next useful AI systems will learn to predict consequences, not just complete sentences.

June 2, 2026 18 min read
Yann LeCun speaking at a conference podium
Yann LeCun at Ecole Polytechnique in 2018. Photo by Jeremy Barande / Ecole polytechnique Universite Paris-Saclay, CC BY-SA 2.0, via Wikimedia Commons.

The most expensive argument in AI right now is not "open weights versus closed weights" or "agents versus copilots."

It is whether intelligence begins with language.

The mainstream answer has been yes, or close enough. Take a transformer, feed it text, scale the hell out of it, bolt on vision, tool use, memory, RL, product screenshots, and vibes. The model starts as autocomplete and somehow becomes useful enough that everyone pretends the ontology is clean.

It is not clean.

Yann LeCun's counter-bet is JEPA, short for Joint Embedding Predictive Architecture. The claim is simple enough to be dangerous: real intelligence does not start with producing words. It starts with building internal models of the world, then using those models to predict what will happen if you act.

This is why LeCun's new company, AMI Labs, reportedly raised $1.03 billion in March 2026 to build world models. That is an absurd amount of capital for what is still mostly a research program. But absurdity is normal now. We gave autocomplete a trillion-dollar cap table. The bar for financial sanity left the building a while ago.

The thesis

LLMs learned language. JEPA tries to learn consequence. The difference matters because agents do not just need to answer. They need to choose actions before reality slaps them for choosing badly.

Why Next-Pixel Prediction Fails

The podcast starts in the obvious place: next-token prediction worked weirdly well for language. You give the model text, hide the next token, and train it to predict the missing piece. Do that at scale and you get GPT-style models with useful internal representations.

So why not do the same thing with video?

Take a few frames. Predict the next frame. Train on the pixels. In theory, the model learns how the world moves.

In practice, it learns blur.

The reason is not mysterious. If a ball might bounce left or right, and your model is forced to output one pixel-level future, the safe answer is the average of possible futures. Average a left bounce and a right bounce and you get a smear. Repeat that autoregressively and your future becomes soup.

Language can dodge some of this because tokens are discrete. "Left" and "right" are different outputs. Video does not have a neat 50,000-token vocabulary. A frame is millions of continuous values, most of them irrelevant.

Leaves moving behind a car have lots of pixels. They should not get lots of intelligence.

Fig. 01 · The wrong target eats the model TWO SELF-SUPERVISED TARGETS LANGUAGE Predict the next token the ball bounced to the left right Discrete choices. Ambiguity can stay explicit. VIDEO Predict the next pixels average future Continuous pixels. Ambiguity becomes blur.
Generative video prediction asks the model to render every uncertainty. JEPA changes the target so the model can focus on the state that matters.

JEPA Moves the Loss Into Meaning Space

JEPA does not ask the model to generate the missing image, frame, text, or action. It asks the model to predict an embedding.

The setup is clean:

Encode X. Encode Y. Train a predictor to map embedding(X) to embedding(Y).

That is the whole trick. The system is still predictive, but it predicts in latent space. It predicts the compressed representation of the future, not the raw sensory surface.

This is why the architecture is not simply "non-generative" as a branding move. It is non-generative because generating the surface is often the wrong job. If I ask you what happens after a cup gets nudged off a table, you do not need to hallucinate every photon. You need the useful state transition: cup falls, gravity wins, floor gets involved.

The floor is not impressed by prose.

Fig. 02 · JEPA in one diagram JOINT EMBEDDING PREDICTIVE ARCHITECTURE Context X visible video patches Target Y missing future state Encoder state vector Target encoder target vector Predictor predicts embedding(Y) Loss in embedding space match the meaningful future No pixel reconstruction. No caption required. Learn the representation that predicts what matters.
The target is not "draw the missing patch." The target is "predict the representation of the missing patch." That one sentence is most of the philosophical shift.

The Old Problem: Collapse

Joint embedding systems have a dumb failure mode. If the objective says "make the two embeddings similar," the model can cheat by returning the same embedding for everything.

Dog? Same vector.

Truck? Same vector.

Founder pretending the roadmap is under control? Same vector, unfortunately plausible.

This is representation collapse. The loss looks happy. The model learned nothing.

The history here matters. LeCun worked on Siamese networks in the 1990s. Later contrastive systems avoided collapse by using negative examples: pull matching pairs together, push non-matching pairs apart. That works, but at scale it can get computationally ugly.

Barlow Twins was one of the cleaner turns. Instead of relying on a mountain of negative samples, it tries to make the cross-correlation matrix between two distorted views look like the identity matrix. Same features should agree across views. Different features should not become redundant.

That sounds academic until you realize what it unlocked: a credible path for self-supervised vision models to learn useful representations without reconstructing pixels and without turning every batch into a contrastive cage fight.

2021
Barlow Twins attacks collapse through redundancy reduction
2023
I-JEPA predicts image target-block representations
2025
V-JEPA 2 scales video world modeling past one million hours

The World Model Part

LeCun's 2022 position paper is not really about beating ChatGPT at chat. It is about autonomous machine intelligence. That means systems that perceive, predict, plan, and act across time.

A world model is the engine in the middle.

At time t, the agent has a state. It imagines an action. The world model predicts the next state. Then the agent searches over possible action sequences until it finds one that leads toward the goal.

That sounds like classical control because it is classical control, wearing a neural network hoodie. The new part is that the state and transition model are learned in a latent representation instead of hand-written with equations.

For simple systems, equations win. If you are NASA and the physics is known, use the equations. Do not vibe-code orbital mechanics.

The interesting cases are messy systems: factories, bodies, chemical plants, engines, robots in kitchens. Systems where you can observe patterns but cannot write the whole dynamical model cleanly.

Fig. 03 · Action-conditioned planning WORLD MODEL AS A SEARCH ENGINE FOR ACTIONS Current state image, sensors, memory Encoder latent state z Predictor z + imagined action Predicted state what would happen? Planner search action sequences score against goal inference becomes search, not one-step imitation
In the robot setting, the model predicts the next latent state conditioned on an action. Planning means testing hypothetical actions inside the learned model before touching the real world.

Why LeCun Keeps Picking on VLA

Vision-language-action models are the current impressive robot demo stack. They take camera frames, language instructions, and robot state, then output actions. The best demos are genuinely wild. Robots fold laundry, move objects, open locks, and look less like lab toys every quarter.

LeCun's critique is not that VLA demos are fake. It is that the learning signal is wrong for robust agency.

Most VLA systems lean heavily on behavioral cloning. They learn from demonstrations. That gets you surprisingly far, but it has a brutal scaling problem: the real world contains more variations than your dataset contains demos.

The second critique is sharper. VLA systems often act without an explicit learned model for predicting the consequences of candidate actions. They may reason internally. They may generalize. But the control loop is still too close to "see state, output action, repeat."

LeCun's jab is short: "VLA are doomed."

I would not say it that confidently. The VLA people are not exactly asleep. But the critique is directionally serious. If you want reliable agents, you eventually need something that can ask, before acting: if I do this, what happens next?

VLA-style behavior cloning

  • Learns actions from demonstrations.
  • Can produce striking end-to-end robot behavior.
  • Generalization depends on the coverage and structure of the data.
  • Planning can be hard to inspect and control.

JEPA-style world model planning

  • Learns state transitions from observation and action-conditioned video.
  • Plans by searching over predicted latent futures.
  • Can solve tasks without imitating the exact human trajectory.
  • Still early, especially for long-horizon real-world control.

What Has Actually Worked So Far

The honest answer: enough to be interesting, not enough to declare victory.

I-JEPA showed that predicting image representations can learn semantic visual features without hand-crafted data augmentations. V-JEPA moved the idea into video. V-JEPA 2 scaled it hard: Meta reports pretraining on more than one million hours of internet video, strong motion understanding, state-of-the-art action anticipation, and zero-shot robot pick-and-place after post-training an action-conditioned model on less than 62 hours of unlabeled robot videos.

That last sentence sounds like a grant application learned to bench press. But the details matter.

It is not a general household robot. It is not replacing multimodal LLMs tomorrow. It is a research system showing that latent video prediction can support understanding, prediction, and planning in a physical environment.

VL-JEPA is another clue. Instead of making a vision-language model autoregressively generate answer tokens during training, it predicts continuous embeddings of target text. In controlled comparisons, the paper reports stronger performance with fewer trainable parameters. That does not kill token-generation. It says token-generation may be the wrong training target for some multimodal tasks.

LeWorldModel is the newer piece that feels closer to the core bet. It trains a compact JEPA world model end-to-end from raw pixels with only two loss terms, then uses the learned latent model for planning. The reported system is small, fast, and competitive on several control tasks. It also still lives in constrained environments. Both things can be true.

Where the evidence points

JEPA is strongest as a representation and planning framework for physical and multimodal problems. LLMs are strongest as language interfaces and reasoning workhorses. The near-term future is probably not JEPA replacing LLMs. It is JEPA giving agents a better world-state layer while language models remain the interface.

The Hard Part Is Time

Short-horizon prediction is one thing. Long-horizon action is another.

If a robot needs to nudge a block, it can predict a few steps ahead. If it needs to clean a kitchen for ten minutes, pixel-level detail becomes poison. The farther you look, the less detail you can afford.

LeCun's answer is hierarchy.

Low-level models predict short-term detail. Higher-level models predict longer-term abstractions. The high level says "get to the airport." The low level handles shoes, door, elevator, taxi, awkward human interaction, and the fact that the suitcase wheel has decided today is its last day on Earth.

This is probably the right shape. It is also where the research debt lives. Learning the right abstraction hierarchy from data is not a small detail. It is most of the problem.

A world model that cannot plan across time is just a prettier representation.

My Read

LeCun is right about the missing piece.

He may or may not be right that JEPA is the winning implementation of that missing piece.

Those are different claims. People collapse them because discourse is lazy and Twitter trained everyone to round nuance down to team sports. We can do better. Barely, but still.

The LLM path gave us language competence, tool use, coding, summarization, retrieval, and agent shells that can do useful work when the environment is forgiving. That is not nothing. It is one of the largest software shifts in history.

But agents that operate in the world need more than language competence. They need predictive state. They need persistent memory. They need uncertainty. They need planning. They need to simulate candidate actions before executing them.

In other words, they need something like a world model.

The interesting question over the next few years is not whether JEPA can win a benchmark. It is whether it can become a boring production primitive. Can it scale to messy long-horizon tasks? Can it integrate with language models without turning into another opaque soup? Can it make agents safer by letting them search futures before acting in reality?

That is the bar.

Not a cool robot clip.

Not a paper title with "world model" in it.

A system that can act because it can predict consequences, and can explain enough of that prediction that a human operator has a fighting chance of trusting it.

That is a harder path than scaling next-token prediction.

It is also probably closer to intelligence.