AI Bias: It's in the Data, Not the Algorithm -- Pure AI

AI Bias: It's in the Data, Not the Algorithm

Leading data scientist Cheryl Martin explains why and how bias found in AI projects can almost always be tracked back to the data, covers the top four types of issues that cause bias and shares steps data scientists can take to address bias issues.

By John K. Waters
07/26/2018

An estimated 85 percent of all artificial intelligence (AI) projects will deliver erroneous outcomes because of human bias in data, algorithms or the teams responsible for managing them, according Gartner analysts. But if you dig into the concept of bias in machine learning, the problem can almost always be traced to the data, says Cheryl Martin.

Martin is Chief Data Scientist at Alegion, an Austin-based provider of human intelligence solutions for AI and machine learning initiatives. Before joining Alegion in 2018, Martin spent 13 years as a research scientist at the University of Texas at Austin, where she directed the Center for Content Understanding, as well as a research group focused on decision support for cyber security. Before that, she worked at NASA's Johnson Space Center on intelligent control for life support systems.

"Bias is one of those things we know is there and influencing outcomes," Martin told Pure AI. "But the concept of 'bias' isn't always clear. That word means different things to different people. It's important to define it in this context if we're going to mitigate the problem."

Martin has identified four types of bias:

Sample Bias/Selection Bias:

This type of bias rears its ugly head when the distribution of the training data fails to reflect the actual environment in which the machine learning model will be running.

"If the training data covers only a small set things you're interested in," Martin explained, "and then you test it on something outside that set, it will get it wrong. It'll be 'biased' based on the sample it's given. The algorithm isn't wrong; it wasn't given enough different types of data to cover the space it's going to be applied in. That's a big factor in poor performance for machine learning algorithms. You have to get the data right."

The example Martin uses here is training machine learning models to control self-driving cars. "If want the car to drive during the day and at night, but you build your training data based on daylight video only, there's bias in your data," she said. "You failed to sample across the entire space in which you want the car to perform, and that's a kind of bias."

2. Prejudices and Stereotypes:
Even if the data seems to have been selected correctly, errors may occur because of prejudices and stereotypes that emerge in the differentiation process. For example, a machine learning model that's designed to differentiate between the sexes that receives lots of pictures of men writing code and women in the kitchen can lead to stereotypes. The indiscriminant acceptance of any speech as equally appropriate can lead to the propagation of hate speech. This an insidious type of bias that's hard to track, Martin said.

"If the data available is that there are more women homemakers and more men computer programmers, and the mathematical model the machine is building is actually taking that distribution into account," she said, "you end up with distributions you don't want to promote." A 50/50 distribution between males and females reflects the desired result in the absence of any other information, but the distribution in the data has a higher proportion of women associated with one class than men.

"As a woman computer programmer, I can say, unequivocally, that this can be a problem," she added. "We as data scientists need to recognize that the distribution might actually reflect that there are more of one class associated with a particular characteristic, in that it's true in the data, but it's not something you want to promote. There's a higher level of reasoning that you have to control for outside the distribution of the data."

The desired value judgments must be added in or results filtered by an external process, she said, because the math doesn't tell the machine learning algorithm how to differentiate. "There is some controversy over what is 'necessary' and I try to convey that the appropriateness of the decision is in the hands of the person/entity creating and using the model. If the purpose for which the model is being used requires it to stay true to actual data distributions, then it should."

3. Systematic Value Distortion:
This less common type of bias in the data occurs when a device returns measurements or observations that are imprecise -- say, a camera that's providing image data has a color filter that's off. The data skew the machine learning results.

"This is when your measuring device is causing your data to be systematically skewed in a way that doesn't represent reality," she explained.

4. Model Insensitivity:
The bias problem in machine learning stems primarily from biased data, Martin said, but there's one exception: "model insensitivity," which is the result of the way an algorithm is used for training on any set of data, even an unbiased set.

"Most of the people who are using models or are interested in how models perform without being in charge of actually creating them are seeing bias that arises from the data the algorithm was trained on," she explained. "However, given any set of data, model insensitivity bias can arise from the model itself and how it is trained.

Addressing the Bias
The onus for addressing the bias that results from these types of data issues rests largely on the shoulders of the data scientist, Martin said. Being aware of them is a start, but addressing them requires better understanding how an algorithm might be deployed, and what the target environment might be. And that, she said, comes only with experience.

One potential solution: using interdisciplinary teams.

"There's a lot of training that psychologists, social scientists, and political scientists receive for, say, administering a survey that includes margin of error based on the sampling," Martin said. "That's the type of technique we see in other disciplines that could go a long way toward getting the bias out of the data."

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].