News
Meta Unveils V-JEPA 2 to Advance AI Understanding of the Physical World
- By John K. Waters
- 06/17/2025
Meta has introduced V-JEPA 2, a self-supervised video model designed to improve machine understanding of the physical world and enable robotic control without requiring task-specific training data. The model achieves state-of-the-art results on multiple visual understanding and prediction tasks and has been released alongside new benchmarks intended to evaluate physical reasoning in AI systems.
V-JEPA 2, which stands for Video Joint Embedding Predictive Architecture 2, builds on previous efforts from Meta to develop general-purpose models trained on video data. The new model uses more than one million hours of video and limited robot interaction data—just 62 hours from the open-source DROID dataset—to learn how to act in new environments and manipulate unfamiliar objects.
Training and Capabilities
V-JEPA 2 is trained in two stages. The first is an action-free pretraining stage where the model uses a mask-denoising objective across a large, curated video dataset. The second stage introduces action-conditioned learning through unlabeled robot videos, allowing the model to predict future states based on actions.
The model achieves 77.3 percent top-1 accuracy on the Something-Something v2 benchmark for motion understanding, and 39.7 recall-at-5 on the Epic-Kitchens-100 dataset for human action anticipation. In video question answering, V-JEPA 2 reaches 84.0 on the PerceptionTest benchmark and 76.9 on TempCompass.
Performance gains were attributed to scaling efforts in both data and model architecture. Increasing the dataset from 2 million to 22 million videos, scaling the model size to over 1 billion parameters, and increasing training iterations and resolution each contributed measurable accuracy improvements.
Applications in Robot Control
The model is used in planning tasks through a predictor component that imagines the outcome of potential actions. In short-horizon tasks such as picking or placing objects, V-JEPA 2 plans actions based on image goals. In long-horizon tasks, it sequences sub-goals, enabling multi-step operations without additional training.
In evaluations, V-JEPA 2-AC, the action-conditioned version of the model, demonstrated high success rates across reaching, grasping, and pick-and-place tasks. Its performance compared favorably against vision-language-action baselines such as Octo and Cosmos.
Model Availability and Benchmarks
The model and associated resources are publicly available:
In addition, Meta has released three new benchmarks to assess physical reasoning:
Limitations and Future Directions
Meta researchers identified limitations including sensitivity to camera position and the challenge of long-horizon planning. The model’s inferred coordinate system is affected by camera placement, which can be mitigated through post-processing transformations.
Future work will explore hierarchical models that plan across multiple time scales and multimodal JEPA variants capable of integrating vision, audio, and tactile data. The research is part of Meta's broader initiative to develop AI systems with human-like world models and planning capabilities.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].