Microsoft's VinVL "Sees" Richer Collection of Objects -- Pure AI

Microsoft's VinVL "Sees" Richer Collection of Objects

By John K. Waters
02/10/2021

Researchers at Microsoft recently unveiled a new object-attribute detection model for image encoding that they claim can generate representations of a richer collection of visual objects and concepts than the most widely used bottom-up and top-down models.

Dubbed VinVL (Visual features in Vision-Language), the new model is "bigger, better-designed for [vision language] tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets," the researchers explain in a recently published paper ("VinVL: Making Visual Representations Matter"). In that paper, the researchers explain that the new model uses a larger collections of written texts ("corpora"), which makes it possible for VinVL to generate representations of a richer collection of visual objects and concepts.

"Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input," researchers Pengchuan Zhang, Lei Zhang, and Jianfeng Gao, explained in a blog post. "One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us."

Vision-language (VL) systems, the researchers point out, allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language. The current crop of VL research focuses primarily on improving the VL fusion model, they said, "and leaves the object detection model improvement untouched."

In their experiments, the researchers fed the visual features generated by the new object detection model into a transformer-based VL fusion model called OSCAR ( (Object-Semantics Aligned Pre-training), which they unveiled in May. "Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks," they wrote.

A transformer is a deep learning model introduced by Google in 2017. It's based on a self-attention mechanism that directly models relationships among all words in a sentence, regardless of their respective positions, rather than one-by-one in order. This capability made transformers much faster than recurrent neural networks (RNNs), the leading approach at the time to natural language processing (NLP). Google introduced its open-source machine-learning framework, BERT (Bidirectional Encoder Representations from Transformers) in 2019 to better understand the context of words in search queries.

The researchers' experiments concluded with some impressive stats: "Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes," they wrote. "As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments."

The Microsoft researchers plan to release the new object-detection model to public. No date was given for that release at press time.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].

Featured

The New AI Security Rules, Perplexity's $34.5B Chrome Bid, More

Pure AI

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
March 10-April 14, 2026

Live! 360 2-Day Hands-On Seminar: Copilot Studio, Microsoft Agent Framework and Foundry: Building Multi-Agent AI Systems
June 8-9, 2026

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 3-7, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

TechMentor Orlando
November 15-20, 2026