News
        
        Security Guru Gary McGraw on What's Needed To Secure Machine Learning Apps
        John K. Waters talks with McGraw about his new focus on securing the next generation of applications. 
        
        
			- By John K. Waters
 - 03/11/2020
 
		
        It's  been more than a decade since a group of software security mavens set out to  create a "fact-based" set of best practices for developing and growing an  enterprise-wide software security program, which came to be known as the  Building Security In Maturity Model (BSIMM). It was the  first maturity model for security initiatives created entirely from real-world  data.
The  ring leader of that group, cybersecurity guru Gary McGraw, has been pushing developers  to take responsibility for building secure software for almost two decades. So, I wasn't surprised when I learned that he'd rounded up another  posse to chase down the facts about the security of machine learning (ML)  systems -- and in the process, developed a new set of recommendations for building  security in.
"I  was seeing all this breathless coverage of security in machine learning," he  told me, "but it was all related to attacks. It reminded me of what we saw in  software security 25 years ago, when people were barely talking about it. It was  clear that AI and machine learning had moved into the mainstream really fast -- and  that no one had done a real risk analysis at the architectural level. So, we  did one," he commented in our interview. 
That  work led McGraw and his colleagues (Harold Figueroa, Victor Shepardson and Richie  Bonett) to publish the first-ever  risk framework to guide development of secure ML ("An Architectural Risk  Analysis of Machine Learning Systems: Toward More Secure Machine Learning"),  and to found the  Berryville Institute of Machine Learning, a research think  tank dedicated to safe, secure, and ethical development of AI technologies.  (McGraw lives in Berryville, VA.)
The  paper is a guide for developers, engineers, designers, and anyone creating  applications and services that use ML technologies. It focuses on providing  advice on building security into ML systems from a security engineering  perspective -- which involves "understanding how ML systems are designed for security,  teasing out possible security engineering risks, and making such risks explicit,"  the authors explained. "We are also interested in the impact of including an ML  system as part of a larger design. Our basic motivating question is: how do we  secure ML systems pro-actively while we are designing and building them? This  architectural risk analysis (ARA) is an important first step in our mission to  help engineers and researchers secure ML systems."
"We  need to do better work to secure our ML systems," they added, "moving well  beyond attack-of-the day and penetrate-and-patch towards real security  engineering."
The  paper, which is free for download, is a must-read. It includes a wealth of  information and insights from experts in the field focused on the unique  security risks of ML systems. It identifies 78 specific risks associated with a  generic ML system, and provides an interactive ML risk framework. 
It  also offers the researchers' Top Ten ML Security Risks. "These risks come in  two relatively distinct flavors," they wrote, "both equally valid: some are  risks associated with the intentional actions of an attacker; others are  risks associated with an intrinsic design flaw. Intrinsic design flaws  emerge when engineers with good intentions screw things up. Of course,  attackers can also go after intrinsic design flaws complicating the situation."
The  list includes:
  Adversarial  examples
    Probably  the most commonly discussed attacks against machine learning have come to be  known as adversarial examples. The basic idea is to fool a machine learning  system by providing malicious input often involving very small perturbations  that cause the system to make a false prediction or categorization. Though  coverage and resulting attention might be disproportionately large, swamping  out other important ML risks, adversarial examples are very much real.
  Data  poisoning 
    Data  play an outsized role in the security of an ML system. That's because an ML  system learns to do what it does directly from data. If an attacker can  intentionally manipulate the data being used by an ML system in a coordinated  fashion, the entire system can be compromised. Data poisoning attacks require  special attention. In particular, ML engineers should consider what fraction of  the training data an attacker can control and to what extent. 
  Online  system manipulation
    An  ML system is said to be "online" when it continues to learn during operational  use, modifying its behavior over time. In this case a clever attacker can nudge  the still-learning system in the wrong direction on purpose through system  input and slowly "retrain" the ML system to do the wrong thing. Note that such  an attack can be both subtle and reasonably easy to carry out. This risk is  complex, demanding that ML engineers consider data provenance, algorithm  choice, and system operations in order to properly address it.
  Transfer  learning attack
    In  many cases in the real world, ML systems are constructed by taking advantage of  an already-trained base model which is then fine-tuned to carry out a more  specific task. A data transfer attack takes place when the base system is  compromised (or otherwise unsuitable), making unanticipated behavior defined by  the attacker possible
  Data  confidentiality
    Data  protection is difficult enough without throwing ML into the mix. One unique  challenge in ML is protecting sensitive or confidential data that, through  training, are built right into a model. Subtle but effective extraction attacks  against an ML system's data are an important category of risk. 
  Data  trustworthiness
    Because  data play an outsize role in ML security, considering data provenance and  integrity is essential. Are the data suitable and of high enough quality to  support ML? Are sensors reliable? How is data integrity preserved? Understanding  the nature of ML system data sources (both during training and during  execution) is of critical importance. Data borne risks are particularly hairy  when it comes to public data sources (which might be manipulated or poisoned)  and online models.
  Reproducibility
    When  science and engineering are sloppy, everyone suffers. Unfortunately, because of  its inherent inscrutability and the hyper-rapid growth of the field, ML system  results are often under-reported, poorly described, and otherwise impossible to  reproduce. When a system can't be reproduced and nobody notices, bad things can  happen. 
  Overfitting
    ML  systems are often very powerful. Sometimes they can be too powerful for their  own good. When an ML system "memorizes" its training data set, it will not  generalize to new data, and is said to be overfitting. Overfit models are  particularly easy to attack. Keep in mind that overfitting is possible in  concert with online system manipulation and may happen while a system is  running. 
  Encoding  integrity
    Data  are often encoded, filtered, re-represented, and otherwise processed before use  in an ML system (in most cases by a human engineering group). Encoding  integrity issues can bias a model in interesting and disturbing ways. For  example, encodings that include metadata may allow an ML model to "solve" a  categorization problem by overemphasizing the metadata and ignoring the real  categorization problem.
  Output  integrity
    If  an attacker can interpose between an ML system and the world, a direct attack  on output may be possible. The inscrutability of ML operations (that is, not  really understanding how they do what they do) may make an output integrity  attack that much easier since an anomaly may be harder to spot. 
    
 McGraw  has been preaching the gospel of security through better software at  conferences, on college campuses, as an advisor to leading security companies,  and in a long running podcast for nearly three decades. He wrote eight books on  the subject, including the best-selling "White Hat/Black Hat Trilogy," which  includes Building Secure Software: How to Avoid Security Problems the Right  Way, which he penned with security expert John Viega, Exploiting  Software: How to Break Code, which he wrote with the semi-legendary Greg  Hoglund, and his solo effort, Software Security: Building Security In.
A  next step for the Berryville gang would be to take the insights they've gleaned  working with a generic ML model and apply it to specific ML systems, McGraw  said. "The other thing you could do is to think about what controls can be  associated with those risks," he added, "so that you can put in place controls  to manage those risks appropriately. There's a ton of work still to be done here."
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge  technologies and culture of Silicon Valley for more than two  decades, and he's written more than a dozen  books. He also co-scripted the documentary film Silicon  Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].