News

At DevDay, OpenAI Makes AI Fine-Tuning a Multimodal Affair

Fresh off a funding round that boosted its valuation to $157 billion, OpenAI has introduced new capabilities for developers using its GPT models.

OpenAI announced the new perks at its invitation-only DevDay event this week.

Streamlined Model Distillation
With new API-based "model distillation," developers now have a single, unified platform where they can fine-tune OpenAI's small language models (SLMs) using data from its more powerful large language models (LLMs).

SLMs are cheaper and consume less resources than their larger, more powerful counterparts, making them ideal for highly specialized tasks and for running on edge devices. Examples of OpenAI SLMs are GPT-4o-mini and o1-mini, which are the smaller-scale versions of GPT-4o and o1-preview, respectively. Developers can improve an SLM's performance by training it on input-output pairs generated by an LLM -- a process called model distillation.

Typically, as OpenAI explained in this post, model distillation is a manual, error-prone and time-consuming process. It "required developers to manually orchestrate multiple operations across disconnected tools, from generating datasets to fine-tuning models and measuring performance improvements," the company said. "Since distillation is inherently iterative, developers needed to repeatedly run each step, adding significant effort and complexity."

A new model distillation suite now available from OpenAI aims to streamline this process. On a single platform, developers can collect input-output pairs, use them to fine-tune their SLM and evaluate the SLM's resulting performance. More information on the model distillation capability is available here.

Support for Image-Based Fine-Tuning
In more fine-tuning news, OpenAI is enabling developers to ground its flagship model, GPT-4o, on images. OpenAI already allows fine-tuning using text-based datasets, but the addition of image-based fine-tuning unlocks a new set of vision capabilities for applications.

"Developers can customize the model to have stronger image understanding capabilities which enables applications like enhanced visual search functionality, improved object detection for autonomous vehicles or smart cities, and more accurate medical image analysis," OpenAI said in this blog.

Developers can upload their vision-based datasets the same way they upload their text-based datasets to OpenAI's platform for GPT-4o fine-tuning. A dataset with as few as 100 images can show demonstrable improvements to GPT-4o's vision capabilities, OpenAI said.

The capability is now available to developers subscribed to any paid OpenAI usage tier. For the month of October, OpenAI is offering discounted image training. Specifically, until Oct. 31, developers can use up to 1 million training tokens per day to fine-tune GPT-4o on images at no cost. Afterward, OpenAI said, "GPT-4o fine-tuning training will cost $25 per 1M tokens and inference will cost $3.75 per 1M input tokens and $15 per 1M output tokens." More details are on this page.  

Speech and Audio Capabilities
Two audio-based features were announced on DevDay. OpenAI is now rolling out a public beta of Realtime API to paying developers, while audio support in the chat completions API will be released in the coming weeks.

The Realtime API supports application scenarios that closely resemble real-world conversations, requiring minimal latency to ensure a user-friendly experience. This is particularly useful for voice assistant applications, where users' spoken inputs need to trigger specific actions, and the outputs are also delivered in spoken form. Adding to the realism, the Realtime API can intuit context even from interruptions, per OpenAI.

This capability complements the new audio support in chat completions. The chat completions API lets developers program AI models to generate text outputs in response to prompts. On DevDay, OpenAI announced that the chat completions API can now process audio as both input and output. "With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both," the company said in a blog.

Previously, this process took multiple steps with distractingly high latency. The input audio first had to be transcribed to text, which then had to be fed to an AI model. The AI would then generate a text output that, in turn, had to be fed to a text-to-speech model. By comparison, with the updated chat completions API, "developers can handle the entire process with a single API call, though it remains slower than human conversation."

According to OpenAI, future updates will add support for other formats beyond voice, including video.

Prompt Caching
Finally, OpenAI is reducing the cost of developers' most frequently used API calls with a new "prompt caching" feature.

"By reusing recently seen input tokens, developers can get a 50% discount and faster prompt processing times," OpenAI said in a blog.

Prompt caching is turned on automatically for GPT-4o, GPT-4o-mini, o1-preview and o1-mini -- both the off-the-shelf versions, as well as versions that users have fine-tuned.

OpenAI provides mode details on prompt caching in this document.

About the Author

Gladys Rama (@GladysRama3) is the editorial director of Converge360.

Featured