Reliable Training Data Still Needs a Human Touch

The evolution of enterprise AI and machine learning continues at an amazing (and occasionally disturbing) pace, but it has yet to outpace the need for human input at critical junctures during its development and deployment. The creation of training data sets, for example, still requires lots of people power.

Austin-based Alegion's namesake training data platform, for example, combines automated and human intelligence to gather, validate, and enrich data to build high quality data sets for AI and ML modeling. The platform integrates trained data specialists with data task management and distribution capabilities to accelerate machine learning projects through the creation of training data, model testing, and exception handling at scale.

"It's comes down to the concept of confidence," explained Nathanial Gates, the company's CEO and co-founder. "Our customers require confidence in their models, both from the standpoint of the standard definition of 'confidence' and the actual figure of confidence data scientists are always chasing. We help them get there through developing training data using human-and-machine-generated data. The other day one of our customers called it 'organic and synthetic data.'"

Alegion recently announced the release of a new version of its training data platform with features designed to enhance the quality and efficiency of large-scale machine learning initiatives and deliver that model confidence for enterprise AI initiatives.

Alegion (pronounced a legion, as in "a legion of workers") started out in 2012 focused on providing automation for organizations using Amazon's Mechanical Turk (MTurk) crowdsourcing marketplace for on-demand human labor. Organizations used the MTurk service to cover such tasks as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, and researching data details. Toward the end of 2016, Gates and company began getting requests from data science teams, who wanted them to use the crowds to develop training data for their AI projects.

"Training data is highly contextual, and data precision at scale is paramount for model confidence," said Alegion data scientist Cheryl Martin, in a statement. "We've developed a platform that gives enterprises the flexibility of working with a combination of humans and automated quality controls to achieve high-quality training data sets for their AI initiatives."

The latest version of the platform adds several capabilities, including:

  • Machine learning-augmented quality, which features machine learning using predictive indicators to score judgments per task and dynamically determine appropriate additional quality control stages, such as consensus judgments, review/adjudicate/exception workflows, and administrative reviews. "The per-task confidence functionality learns updates from subsequent quality control stages and is calculated relative to both use case and context, allowing the platform to continually improve escalation decisions enhancing accuracy without increasing time and cost," the company said.
  • Flexible workforce composition, which introduces flexibility for sourcing human intelligence, including the ability to supply private or specialized workforces and create hybrid workforces that leverage Alegion's own data specialists and partners. "This expands Alegion's ability to serve customers with a wide array of training needs through the creation of purpose-designed workforce, the ability to 'bring your own crowd' of domain or geographic specialists, including employees, and through the support of strong security requirements by isolating data access to cleared and qualified specialists," the company said.
  • Artificial intelligence system integration, which supports end-to-end integration with an enterprise's AI infrastructure through APIs that support real-time data exchange or by batch, and flexible input/output formats. "This programmatic integration of human intelligence into the AI lifecycle accelerates model training and testing by taking data as it is generated, processing it, and returning it to the model in real-time," the company said. "This process allows for continuous model testing and as models mature, escalation of low-confidence results to human judgment."

"We are still very much at the stage where there's a great dependency on human intelligence to inform the training of machines," Gates said. "We're trying to take this industry from human powered, to empowered humans, giving humans 'superpowers' with the machine enablement."

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at