Through Our Eyes

A world model is a mind’s best guess at what happens next. For a large language model, that guess lives in text. For a humanoid robot, it has to live in the physical world: in contact forces, depth, occlusion, friction, gravity. The frontier in AI is no longer language. It is physics. And the systems that will act in the physical world, the humanoids walking onto factory floors, the domestic robots about to enter our homes, the world models that drive them, are starving for the one thing that cannot be scraped: footage of the real world, seen from the inside.

This is a manifesto about that missing piece. It is also a statement of what World Factory is willing to build to supply it.

The Physics of the World

Jensen Huang keeps repeating the same line on every stage he takes:

“The ChatGPT moment for general robotics is just around the corner.” (NVIDIA)

He is right about the moment, but what he is really describing is a famine. Large language models had the internet. Physical AI does not. There is no Common Crawl for the way a hand closes around a doorknob, no Wikipedia for what it feels like to carry a toddler up a flight of stairs, no Stack Overflow for the thousands of micro-corrections a body makes when balancing a cup of coffee on a moving tray.

Humanoid and world-model research has, in a few short years, compressed the entire manipulation stack into a single question: how do you teach a machine physics without making it live a life first? Physics, here, is not a page of Newtonian equations. It is everything the body already knows before language arrives — the give of fabric, the weight of groceries, the exact distance between your fingers and a stove.

First-person. Where a humanoid will one day sit.

The Data Bottleneck

Every serious lab building general-purpose robots now agrees on one thing: hardware has caught up, compute is plentiful, architectures are converging. The bottleneck is experience.

Brett Adcock, the CEO of Figure, has been saying this for two years without varying the wording much:

“We don’t have a GPU problem. We have a data problem and a robot problem.”

Or, more bluntly: “We believe that data can solve almost all current problems.” (36Kr) The lines are almost boring now in how often they are echoed across the field. Imitation learning from teleoperation does not scale: a skilled operator spends a day of work to produce minutes of usable training data. Simulation closes part of the gap, but breaks the moment friction, deformation, or real human context enters the scene. Internet video is plentiful but overwhelmingly exocentric, shot from the spectator’s seat, and models trained on it struggle to reach into a world they have only ever watched from across the room.

Why Egocentric

The missing modality is first-person. Egocentric footage, captured from where a hand or a head actually is, is the closest analog to the input stream a humanoid will one day receive. It carries gaze, reach, and intent. It captures the moment of contact: the instant a finger meets a fabric fold, a palm wraps a mug, a foot lands on a wet tile. It records, in place, the thousand tiny corrections a human body makes without noticing.

In February 2026, NVIDIA’s GEAR lab made the egocentric bet explicit. Their paper, EgoScale (arxiv:2602.16710), trained a vision–language–action model on 20,854 hours of action-labeled egocentric human video, more than twenty times any prior effort, and uncovered a log-linear scaling law between human video scale and validation loss that tracks directly with downstream real-robot performance. The paper’s conclusion is blunt: dexterous manipulation transfer is, fundamentally, a scaling phenomenon. Add egocentric data; the robot gets better. Add more; it gets better still. The final policy improved average success rate by 54% over a no-pretraining baseline on a 22-DoF dexterous hand, and transferred cleanly to lower-DoF robots. First-person human data, in other words, behaves as a reusable, embodiment-agnostic motor prior.

Physical Intelligence arrived at the same conviction from a different angle. Their latest model, π0.7 (Physical Intelligence), shows the first signs of compositional generalization: a robot that has never seen laundry folding can fold laundry; a robot that has never met a particular kitchen appliance can use one anyway. The company describes it as an early but meaningful step toward a general-purpose robot brain, one that can be pointed at an unfamiliar task and coached through it in plain language. Sergey Levine, one of π’s co-founders, describes what comes next as a “self-improvement flywheel”:

“Getting real-world data and then deploying robots that are going to collect more experience and get better and better is a lot easier than inventing some other technology just to avoid having to do that.” (Dwarkesh)

Read those two research programs side by side and the same sentence emerges: the shortest path to a capable humanoid is not cleverer architectures, but more first-person video of the world.

EgoScale’s log-linear scaling law · data → model quality
The curve is a conceptual visualization of the log-linear relationship established in EgoScale (NVIDIA GEAR, 2026). The anchor point — 20,854 h → +54% real-robot success uplift — is the paper’s reported figure.
NEO, the humanoid robot by 1X.(1X, via YouTube)

The Scale That Matters

The same conviction runs through every lab that sets the pace today.

  • NVIDIA Isaac GR00T N1.7 (NVIDIA): an open reasoning vision–language–action model for humanoids, rebuilt on a new Cosmos-Reason2-2B backbone and now being deployed by Humanoid, LG Electronics, NEURA, and Noble Machines.
  • NVIDIA Cosmos (NVIDIA): a family of world foundation models that Huang calls a breakthrough for physical AI, pitched as the infrastructure for a world where, in his words, everything that moves will be robotic and embodied by AI.
  • Physical Intelligence π0 and π0.7 (Physical Intelligence): generalist robot policies trained across eight distinct embodiments and partially open-sourced, each new release pushing harder into emergent, untaught behavior.
  • Meta Ego-Exo4D (arxiv:2311.18259): 1,422 hours of skilled activity captured across 131 scenes in 13 cities, simultaneously from egocentric and exocentric viewpoints, with multichannel audio, eye gaze, 3D point clouds, and IMU.
  • Meta V-JEPA 2 (arxiv:2506.09985): a self-supervised video world model trained on more than one million hours of video, which develops an intuitive grasp of physics (close to 98% on the IntPhys plausibility benchmark) and can be adapted for robot planning with fewer than 62 hours of unlabeled robot video.
  • EgoDex (arxiv:2505.11709), EgoVLA (project page), and a wave of adjacent work, all of it converging on the same thesis: pretraining on first-person human video is the densest, cheapest supervision signal available to any humanoid or world model trying to operate in the real world.

Taken together, the last eighteen months of frontier research read like a single extended sentence: more egocentric video, please.

Physical AI · 30real dataset & model releases, 2022–2026
Dataset · 11World model · 7Robot policy · 10Foundation model · 2202220232024202520262027
Hover a point to see what was released.

World Models Need a World to Model

Bernt Børnich, the CEO of 1X and the person shipping one of the most widely deployed humanoids in the world, puts the idea in plain terms:

“After years of developing our World Model and making NEO’s design as close to human as possible, NEO can now learn from internet-scale video and apply that knowledge directly to the physical world.” (The Robot Report)

The recipe is now out in the open. Scale a world model on video. Give it a humanoid-shaped body so the knowledge transfers. Let the embodied experience of deployed fleets feed the next generation of models. Repeat. Capital agrees: more than $2B flowed into dedicated world-model labs in the first quarter of 2026 alone, led by Fei-Fei Li’s World Labs in February and Yann LeCun’s JEPA-native AMI Labs in March, whose $1.03B seed at a $4.5B valuation is the largest in European history (TechCrunch). A widening cohort of younger world-model startups sits behind them. The loop has a missing ingredient, though, and it is the first one: video that looks like what a body actually sees. Most footage on the internet was filmed to entertain another human, not to teach a machine how to move. The supply of content produced from the first-person viewpoint, across diverse environments, cultures, and tasks, is vanishingly small compared to what these systems can now absorb.

A world model needs a world to model. Today we are feeding it a postcard of one.

A Planet, First-Person

Every existing egocentric dataset leaves a hole. Ego4D (arxiv:2110.07058) assembled 3,670 hours; Ego-Exo4D (arxiv:2311.18259), 1,422; EgoDex (arxiv:2505.11709), 829; EgoScale (arxiv:2602.16710), 20,854. Impressive, and yet, taken together, a rounding error on the world’s actual geography. They skew toward a handful of labs, a handful of countries, and a narrow catalog of pre-scripted tasks. They do not contain the room you are sitting in right now. They do not contain the weather in Lagos, the commute in Jakarta, the kitchen of a grandmother in Naples. They do not contain your hands.

Scale alone does not close the hole. A dataset can grow by an order of magnitude without widening the world it was filmed in. Build AI’s recent release (huggingface.co/builddotai) — roughly a million hours of head-mounted factory-floor video from workers across Southeast Asia — is the clearest current illustration: the largest egocentric dataset ever published, collected almost entirely on assembly lines. Useful, at that scale, for the manipulation primitives a humanoid arm will need to know. Silent on almost everything else it will encounter the moment it leaves the factory. The lesson is not that scale is wrong. It is that hours are a unit, and diversity is the other one. A foundation model meant to act in the physical world has to be shown the physical world — every room of it, every task in it, every culture inside it — not one slice, filmed a million times.

Public egocentric datasets · hours of video
Sources: Meta AI, Apple Research, NVIDIA GEAR, Build AI. Hover a bar for details.

World Factory is a bet that the next step change in physical AI will not come from any single lab scaling a single collection site. It will come from crowdsourcing the world. Our aim is simple, and it is enormous:

Scan the planet — every environment, every task, every culture, every light condition — and convert it into training data for the models that will act in it.

We start from the first-person view because that is the vantage a humanoid will inherit. But we are deliberately source-agnostic on the capture device: a phone in a coat pocket, an action cam on a harness, a dashcam on a taxi in São Paulo, a home camera in Seoul, a smart-glasses stream from a kitchen in Marseille. The diversity is the product. A foundation model trained on World Factory should have seen more kitchens, more staircases, more sidewalks, more handshakes than any other foundation model on earth. And it should have seen them from the inside.

Our Commitment

A manifesto is cheap. The work is not. Here is what we owe, and what we are building toward.

  • Contributors get paid. The people who capture the world are not free labor. Our payouts reward diversity, rarity, and quality of capture, not volume for its own sake.
  • Privacy is a design constraint. Every pipeline we build assumes the end dataset must be free of identifiable personal data. Our approach is documented in full in our Privacy Policy, with on-device anonymization on the roadmap wherever the originating hardware permits.
  • Research-grade data, not content. Every clip is curated, versioned, and quality-controlled. We ship datasets, not feeds.
  • Open, where we can be. A portion of our anonymized data is periodically released to the public research community to support reproducible science and benchmarking, in the same spirit as Ego4D and Ego-Exo4D, scaled beyond any single institution.

The CEOs of the companies building tomorrow’s robots agree on one thing: data is the bottleneck. The labs building tomorrow’s world models agree on another: first-person video is the densest supervision signal available to them. World Factory exists to make that data abundant, ethically sourced, and planetary in scope.

The ChatGPT moment for physical AI is close. We would like to help bring it forward.