News
Hugging Face Releases SmolVLA, a Compact Open-Source Robotics Model
- By John K. Waters
- 06/10/2025
Hugging Face has introduced SmolVLA, a lightweight open-source Vision-Language-Action (VLA) model for robotics that operates on consumer-grade hardware and is trained entirely on community-contributed data. At 450 million parameters, SmolVLA aims to offer efficient, reproducible performance for robotic tasks without reliance on proprietary datasets or expensive infrastructure.
Generalist Performance, Minimal Resources
SmolVLA-450M is designed to perform general-purpose manipulation tasks using visual and language cues. The model combines a compact vision-language backbone with a flow-matching transformer that predicts sequences of robot actions. Despite its relatively small size and modest training data — fewer than 30,000 episodes — SmolVLA matches or outperforms larger models like ACT on both simulated (LIBERO, Meta-World) and real-world (SO100, SO101) benchmarks.
SmolVLA’s architecture includes a modified SmolVLM2 encoder-decoder stack, reduced visual token use, and selectively truncated layers during inference. These optimizations cut response time by up to 30 percent and double task throughput in real-world settings.
Asynchronous Inference and Efficient Control
The model's asynchronous inference capability is key to its performance. Unlike synchronous modes that pause between predictions, SmolVLA pipelines execution and inference, allowing robots to request the next action chunk while performing the current one. This results in greater responsiveness and smoother task execution.
A remote policy server can handle inference, enabling real-time deployment even with low-latency consumer devices. Benchmarks show that asynchronous inference allows SmolVLA-equipped robots to complete 2× more tasks within fixed time constraints compared to synchronous setups.
Community Data, Real-World Variability
SmolVLA is pretrained on 10 million frames curated from 487 community datasets tagged under “lerobot” on Hugging Face. These datasets span a variety of environments — from labs to living rooms — and were selected for diversity over size. Unlike benchmark datasets, these include noisy labels, inconsistent camera views, and suboptimal demonstrations, mimicking real-world complexity.
To standardize this data, contributors implemented camera view remapping and automatic instruction refinement using Qwen2.5-VL-3B-Instruct. Labels were rewritten to maximize clarity and brevity, enhancing training consistency.
Pretraining on this dataset raised SmolVLA’s success rate on the SO100 task suite from 51.7% to 78.3%. Further multitask finetuning improved generalization on unseen object configurations and control setups.
Training and Deployment
SmolVLA is released with a complete training and deployment stack. Users can fine-tune the model using the LeRobot framework or build from architecture-level components. SmolVLA runs on CPUs and single consumer GPUs, including MacBooks.
Training from scratch or fine-tuning from the base checkpoint is supported using simple commands from the lerobot
repository.
A Step Toward Open Robotics
With SmolVLA, Hugging Face continues its push for open and reproducible AI tools. By releasing a performant robotics model built entirely on decentralized data and low-cost hardware, the company hopes to lower the barrier to generalist robotics research.
SmolVLA and its datasets are available on GitHub and the Hugging Face Hub. More technical details are included in the accompanying report and documentation.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].