VainuLabs Solves 'Apple-to-Apples' Problem with AI and Machine Learning
- By John K. Waters
When the founders of VainuLabs set out four years ago to build a data-driven B2B sales prospecting platform with real scope, they knew they would need to collect a vast amount of accurate data on companies worldwide into a comprehensive database, from which they could create a structured training set for their machine learning technologies.
Along the way they ran into what you might call an Apple-to-apples problem.
"We are reading 1.5 million news stories per day," VainuLab's CTO and co-founder Tuomas Rasila told PureAI. "And we want to understand whenever a company is being mentioned in the text, and then find that company in our database. But let's say there's a news story saying Apple just release a new iPhone. We also need to be able, from natural language, to understand that this is a company name."
Sorting "Apple" the company from "apple" the fruit is just the tip of the iceberg. Thousands of companies have names that are also plain English expressions (Red Hat, Oracle, Alphabet, Twitter, Amazon), as well as proper names (Dell, Suzuki). It happens in other languages, too, of course; Samsung is Korean for "three stars." And different companies can have the same name. (My domain name is "watersworks" and I get e-mail meant for pool cleaning services at least once a week.)
The process for making these distinctions is a well-known subset of natural language recognition called Named-Entity Recognition (NER). The benchmark standard for NER has been the Stanford NER, a Java implementation of the Named Entity Recognizer, and Google Cloud Natural Language has been the leading provider of a natural language recognition API that includes entity recognition capabilities. Until recently, VainuLabs relied on the Google NER.
"Our goal is to read and understand the whole textual content of the Internet, and every time a company is mentioned, find that story, and then extract information from that unorganized text," Rasila said. "We just got to the point where [Google NER] wasn't sufficient to our needs."
It didn't help that the Google NER didn't provide solutions in Finnish, Swedish, or Norwegian; VainuLabs is a Finnish company. To be fair, Stanford NER neglected our Nordic neighbors, too. And English is the dominant Internet lingua and the most researched language in named-entity recognition circles.
Fortunately, Rasila's company already had data from about 120 million companies tucked safely into its ever-expanding database, so the raw training data was there. Rasila says acquiring that data is often the toughest part of the process.
"Whenever you look at any kind of machine learning program, I'd say 95 percent percent of the program is collecting the training set," Rasila said. "So, we had a really good head start."
Using NeruoNER, a named-entity recognition program based on neural networks to create a clustering and classification layer on top of that database, and "huge amounts" of the human workers from Amazon's Mechanical Turk crowdsourcing marketplace for cross validation, VainuLabs was able to convert that raw data into a massive, structured training set, Rasila explained. The company then employed deep learning techniques to create what he believes is the most accurate NER in the world today.
It's worth noting here that the company was founded by a team of machine learning scientists, data engineers, and strategists. The core team consists of PhDs in computer science with decades of experience focusing on machine learning. So really, an internally developed ML tool was kind of inevitable.
The technology is currently being used as a part of Vainu's company intelligence platform and offered as a part of its technology stack for corporate customers, but, as often happens, the effort to solve internal problems has led to the development of a tool with wider applications. And the company is considering making its internal solution available to wider audiences in upcoming releases.
"Beyond the original use case of collecting vast amounts of publicly available information about companies of the world, this technology could potentially be used for a number of tasks like searching companies in unorganized textual databases through corporate databases and e-mails," said Riko Nyberg, VainuLabs's Head of AI, in a statement.
"We've been considering opening an API or selling it separately," Rasila said. "It's something that I really hope happens."
So, how does the VainuLabs' homegrown NER solution perform. According to the company, it kicks butt in the Nordic languages, with an 85.89 percent accuracy (F1) score. But it truly rocks in English. In fact, it edged Stanford NER in a test of overall accuracy using the English test set provided by the University: 94.20 percent to 92.99 percent.
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at email@example.com.