Google's DeepMind AI Beats Classic Atari Games Benchmark

An artificial intelligence (AI) agent developed by Google's DeepMind group in the United Kingdom has outperformed human players on a key benchmark of so-called reinforcement learning (RL).

The long-standing benchmark is a suite of 57 classic Atari games. DeepMind's Agent57 is the first deep RL agent to achieve this performance level on all 57 games.

RL is a type of dynamic programming that trains algorithms using a system of rewards and punishments. An RL algorithm, or agent, learns by interacting with its environment. Google's AI lab recently used Deep RL to program a robot to teach itself to walk.

Why use classic arcade games to benchmark this capability?

"Games are an excellent testing ground for building adaptive algorithms," the researchers said in a blog post. "They provide a rich suite of tasks that players must develop sophisticated behavioral strategies to master, but they also provide an easy progress metric -- game score -- to optimize against."

In fact, the Arcade Learning Environment (ALE) was first proposed in 2012 by Canadian researchers. The ALE comprised a suite of 57 Atari 2600 games (dubbed Atari57). These are "canonical" Atari games that pose a broad range of challenges for an agent to master. The research community commonly uses this benchmark to measure progress in building successively more intelligent agents. 

The DeepMind researchers trained a neural network to parameterize a family of policies that range from very exploratory to purely exploitative. And they developed an adaptive mechanism to choose which policy to prioritize throughout the training process. They also utilized a novel parameterization of the architecture that allows for more consistent and stable learning.

The ultimate goal, of course, is not to develop systems that excel at games, they said, but to use games as a stepping stone for developing systems that learn to excel at a broad set of challenges.

"Typically, human performance is taken as a baseline for what doing 'sufficiently well' on a task means," they said. "The score obtained by an agent on each task can be measured relative to representative human performance, providing a human normalized score; 0% indicates that an agent performs at random, while 100% or above indicates the agent is performing at human level or better."

With Agent57, the researchers have succeeded in building a more generally intelligent agent that has "above-human performance" on all tasks in the Atari57 benchmark. This agent builds on lessons learned developing the group's previous agent, dubbed "Never Give Up," and "instantiates an adaptive meta-controller that helps the agent to know when to explore and when to exploit, as well as what time-horizon it would be useful to learn with," they said.

The researchers concluded: "Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got. While this enabled Agent57 to achieve strong general performance, it takes a lot of computation and time; the data efficiency can certainly be improved. Additionally, this agent shows better 5th percentile performance on the set of Atari57 games. This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance ... [A]ll current algorithms are far from achieving optimal performance in some games. To that end, key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment."

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at