News
        
        AI Bias: It's in the Data, Not the Algorithm
        Leading data scientist Cheryl Martin explains why and how bias found in AI projects can almost always be tracked back to the data, covers the top four types of issues that cause bias and shares steps data scientists can take to address bias issues. 
        
        
        An estimated 85 percent of all artificial intelligence (AI)  projects will deliver erroneous outcomes because of human bias in data,  algorithms or the teams responsible for managing them, according Gartner  analysts. But if you dig into the concept of bias in machine learning, the  problem can almost always be traced to the data, says Cheryl Martin. 
Martin is Chief Data Scientist at Alegion, an Austin-based  provider of human intelligence solutions for AI and machine learning  initiatives. Before joining Alegion in 2018, Martin spent 13 years as a  research scientist at the University of Texas at Austin, where she directed the  Center for Content Understanding, as well as a research group focused on  decision support for cyber security. Before that, she worked at NASA's Johnson  Space Center on intelligent control for life support systems. 
"Bias  is one of those things we know is there and influencing outcomes," Martin told Pure AI. "But the concept of 'bias'  isn't always clear. That word means different things to different people. It's  important to define it in this context if we're going to mitigate the problem."
Martin  has identified four types of bias:
  - Sample  Bias/Selection Bias:
 
This  type of bias rears its ugly head when the distribution of the training data  fails to reflect the actual environment in which the machine learning model  will be running. 
"If  the training data covers only a small set things you're interested in," Martin explained,  "and then you test it on something outside that set, it will get it wrong.  It'll be 'biased' based on the sample it's given. The algorithm isn't wrong; it  wasn't given enough different types of data to cover the space it's going to be  applied in. That's a big factor in poor performance for machine learning  algorithms. You have to get the data right."
The  example Martin uses here is training machine learning models to control  self-driving cars. "If want the car to drive during the day and at night, but  you build your training data based on daylight video only, there's bias in your  data," she said. "You failed to sample across the entire space in which you  want the car to perform, and that's a kind of bias."
2. Prejudices and  Stereotypes:
  Even  if the data seems to have been selected correctly, errors may occur because of  prejudices and stereotypes that emerge in the differentiation process. For  example, a machine learning model that's designed to differentiate between the  sexes that receives lots of pictures of men writing code and women in the  kitchen can lead to stereotypes. The indiscriminant acceptance of any speech as  equally appropriate can lead to the propagation of hate speech. This an  insidious type of bias that's hard to track, Martin said.
"If  the data available is that there are more women homemakers and more men  computer programmers, and the mathematical model the machine is building is  actually taking that distribution into account," she said, "you end up with distributions  you don't want to promote." A 50/50 distribution between males and females  reflects the desired result in the absence of any other information, but the  distribution in the data has a higher proportion of women associated with one  class than men. 
"As  a woman computer programmer, I can say, unequivocally, that this can be a  problem," she added. "We as data scientists need to recognize that the  distribution might actually reflect that there are more of one class associated  with a particular characteristic, in that it's true in the data, but it's not  something you want to promote. There's a higher level of reasoning that you  have to control for outside the distribution of the data."
The  desired value judgments must be added in or results filtered by an external  process, she said, because the math doesn't tell the machine learning algorithm  how to differentiate. "There is some controversy over what is 'necessary' and I  try to convey that the appropriateness of the decision is in the hands of the person/entity creating and using the  model. If the purpose for which the model is being used requires it to stay  true to actual data distributions, then it should." 
3. Systematic  Value Distortion:
  This  less common type of bias in the data occurs when a device returns measurements  or observations that are imprecise -- say, a camera that's providing image data  has a color filter that's off. The data skew the machine learning results.
"This  is when your measuring device is causing your data to be systematically skewed  in a way that doesn't represent reality," she explained. 
4. Model  Insensitivity:
  The  bias problem in machine learning stems primarily from biased data, Martin said,  but there's one exception: "model insensitivity," which is the result of the  way an algorithm is used for training on any set of data, even an unbiased set.
"Most  of the people who are using models or are interested in how models perform  without being in charge of actually creating them are seeing bias that arises  from the data the algorithm was trained on," she explained. "However, given any  set of data, model insensitivity bias can arise from the model itself and how  it is trained.  
Addressing the  Bias
  The  onus for addressing the bias that results from these types of data issues rests  largely on the shoulders of the data scientist, Martin said. Being aware of  them is a start, but addressing them requires better understanding how an  algorithm might be deployed, and what the target environment might be. And  that, she said, comes only with experience. 
One  potential solution: using interdisciplinary teams. 
"There's  a lot of training that psychologists, social scientists, and political  scientists receive for, say, administering a survey that includes margin of  error based on the sampling," Martin said. "That's the type of technique we see  in other disciplines that could go a long way toward getting the bias out of  the data."
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge  technologies and culture of Silicon Valley for more than two  decades, and he's written more than a dozen  books. He also co-scripted the documentary film Silicon  Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].