OpenAI Tool Will Let Content Owners Opt Out of AI Training

In response to persistent concerns about what data is being used to train public LLMs, OpenAI is creating a tool that would let content owners control how or if their data can be used to train its AI models.

The tool, called "Media Manager," is slated for release sometime in 2025, OpenAI announced in a blog post Tuesday.

"We're collaborating with creators, content owners, and regulators as we develop Media Manager," the company said, adding, "[W]e hope it will set a standard across the AI industry."

Details about the tool are still scant, but OpenAI positions Media Manager as a way to give content creators -- from artists, journalists, musicians and writers to public figures in general -- more control over how they communicate their willingness to participate in OpenAI's LLM training processes. It will do this via a two-pronged approach:

  • By identifying what types of online content -- including text, video, audio and images -- are copyrighted, regardless of whether that content has been reposted in multiple places.
  • By letting the owners of that content specify to what extent they want to be included AI training.

Building a tool like this will require "cutting-edge machine learning research," said OpenAI.

Currently, OpenAI uses its GPTBot Web crawler to scour the Internet for content that it can use to train its models. GPTBot automatically ignores data that's behind a paywall, contains personally identifiable information (PII), or has been flagged as content that violates OpenAI policies. Site owners can also block GPTBot by modifying their site's robots.txt.

However, OpenAI conceded, the crawler doesn't account for content that has been "quoted, reviewed, remixed, reposted and used as inspiration across multiple domains," often without the original owner's knowledge. The forthcoming Media Manager will aim to address those issues.

How OpenAI's Model Sausage Doesn't Get Made
The rest of OpenAI's post, titled "Our approach to data and AI," takes pains to describe what OpenAI doesn't do to train its LLMs.

It said it doesn't store or retain access to data that's been used in training, for instance.

As mentioned above, it doesn't use paywalled data, data that's been flagged as against OpenAI policy, or data that contains PII.

It doesn't train its models to simply parrot original content. "If on rare occasions a model inadvertently repeats expressive content, it is a failure of the machine learning process," OpenAI said. "This failure is more likely to occur with content that appears frequently in training datasets, such as content that appears on many different public websites due to being frequently quoted."

It doesn't train new models on old datasets.

It doesn't use its business customers' data, "including data from ChatGPT Team, ChatGPT Enterprise, or our API Platform."

Primarily, according to the post, OpenAI's LLMs are trained on data provided via three avenues:

  • Human input
  • Publicly available data
  • Partnerships with entities, such as governments and libraries

"While we believe legal precedents and sound public policy make learning fair use, we also feel that it’s important we contribute to the development of a broadly beneficial social contract for content in the AI age," the company said. "We believe AI systems should benefit and respect the choices of creators and content owners. We’re continually improving our industry-leading systems to reflect content owner preferences."

About the Author

Gladys Rama (@GladysRama3) is the editorial director of Converge360.