Microsoft's VinVL "Sees" Richer Collection of Objects

Researchers at Microsoft recently unveiled a new object-attribute detection model for image encoding that they claim can generate representations of a richer collection of visual objects and concepts than the most widely used bottom-up and top-down models.

Dubbed VinVL (Visual features in Vision-Language), the new model is "bigger, better-designed for [vision language] tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets," the researchers explain in a recently published paper ("VinVL: Making Visual Representations Matter"). In that paper, the researchers explain that the new model uses a larger collections of written texts ("corpora"), which makes it possible for VinVL to generate representations of a richer collection of visual objects and concepts.

"Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input," researchers Pengchuan Zhang, Lei Zhang, and Jianfeng Gao, explained in a blog post. "One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us."

Vision-language (VL) systems, the researchers point out, allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language. The current crop of VL research focuses primarily on improving the VL fusion model, they said, "and leaves the object detection model improvement untouched."

In their experiments, the researchers fed the visual features generated by the new object detection model into a transformer-based VL fusion model called OSCAR ( (Object-Semantics Aligned Pre-training), which they unveiled in May. "Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks," they wrote.

A transformer is a deep learning model introduced by Google in 2017. It's based on a self-attention mechanism that directly models relationships among all words in a sentence, regardless of their respective positions, rather than one-by-one in order. This capability made transformers much faster than recurrent neural networks (RNNs), the leading approach at the time to natural language processing (NLP). Google introduced its open-source machine-learning framework, BERT (Bidirectional Encoder Representations from Transformers) in 2019 to better understand the context of words in search queries.

The researchers' experiments concluded with some impressive stats: "Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes," they wrote. "As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments."

The Microsoft researchers plan to release the new object-detection model to public. No date was given for that release at press time.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at