Open AI's New Neural Net Learns Visual Concepts from Natural Language Supervision

Artificial intelligence (AI) company OpenAI is introducing a neural network called CLIP (Contrastive Languageā€“Image Pre-training), which it claims "efficiently learns visual concepts from natural language supervision." CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, the company says. It's similar to the "zero-shot" capabilities of its GPT-2 and 3 neural-network-powered language models.

Zero-shot learning refers to a type of machine learning (ML) in which the model needs to classify data based on very few (or even no) labeled examples--in other words, classify on the fly.

"Although deep learning has revolutionized computer vision, current approaches have several major problems," the company explained in a blog post, "typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests,1234 casting doubt on the entire deep learning approach to computer vision."

OpenAI is seeking to address these problems with CLIP, which builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning dating back more than a decade. The concept is explained in the lengthy blog post and a 47-page white paper ("Learning Transferable Visual Models From Natural Language Supervision"). The paper's abstract reads, in part:

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn [state of the art] image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as [optical character recognition], action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

OpenAI was originally founded as a non-profit open-source organization by a group of investors that included Tesla founder Elon Musk. Today it comprises two entities: the non-profit OpenAI Inc. and the for-profit OpenAI LP. Microsoft, which is OpenAI's cloud services provider, has invested $1 billion in the company.

The company made a splash last year with the general availability release of GPT-3, and its ability to write everything from articles and poems to working computer code and guitar tablature.

The latest blog post concludes with high hopes for the CLIP neural net: "With CLIP, we've tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in [natural language processing], can also be leveraged to improve the performance of deep learning for other fields. We are excited by the results we've seen so far applying this approach to computer vision. Like the GPT family, CLIP learns a wide variety of tasks during pre-training which we demonstrate via zero-shot transfer. We are also encouraged by our findings on ImageNet that suggest zero-shot evaluation is a more representative measure of a model's capability."

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at