News
        
        Facebook AI Research Open Sources ML Framework for Online Speech Recognition
        
        
        
			- By John K. Waters
 - 01/15/2020
 
		
        
Facebook's AI Research (FAIR) group has open sourced its wav2letter@anywhere inference framework for online speech  recognition, the company announced. This release builds on FAIR's previous  release of wav2letter and wav2letter++.
Wav2letter@anywhere is a multithreaded and multiplatform  library aimed at researchers, production engineers and students who need to  put together trained deep neural network (DNN) modules for online inference  quickly. 
"Online  speech recognition" is the process of transcribing speech in real-time  from an input audio stream. It's the "real-time" aspect that's not  addressed by typical Automatic Speech Recognition (ASR) systems, FAIR  researchers Vineel Pratap and Ronan Collobert explained in a blog post. For  applications such as live video captioning or on-device transcriptions, reducing  the latency between the audio and the corresponding transcription is critical. 
"Most  existing online speech recognition solutions support only recurrent neural  networks (RNNs)," they wrote. "For wav2letter@anywhere, we use a  fully convolutional acoustic model instead, which results in a 3x throughput  improvement on certain inference models and state-of-the-art performance  on LibriSpeech."
The  framework provides streaming API inference that is efficient yet modular  enough to handle various types of speech recognition models. It supports concurrent  audio streams, which are necessary for high throughput when performing tasks at  production scale. The API should be flexible enough to be easily used on  different platforms (personal computers, iOS, Android, et cetera).
Written  in C++, wav2letter@anywhere is part of the wav2letter++ repository. It comes  with a modular streaming API that allows the framework to support various  models, including recurrent neural networks (RNNs) and convolutional neural  networks (CNNs), which are faster. It is a standalone repository that can be  embedded anywhere, the researchers said. And it uses efficient back ends, such  as FBGEMM, and specific routines for iOS and Android. 
"From  the beginning, it was developed with streaming in mind," Pratap and  Collobert wrote, "unlike some alternatives that rely on generic inference  pipeline, allowing us to implement an efficient memory allocation design."
"We  have made extensive improvements since open-sourcing wav2letter++ a year ago,"  they added, "including beefing up decoder performance (10x speedup on  seq2seq decoding); adding Python bindings for features, decoder, criterions,  etc.; and better documentation. We believe wav2letter@anywhere represents  another leap forward by enabling online speech recognition and significantly  reducing the latency between audio and transcription. We are excited to share  the open source framework with the community."
There's  more information about wav2letter@anywhere available in a paper and a wiki.
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge  technologies and culture of Silicon Valley for more than two  decades, and he's written more than a dozen  books. He also co-scripted the documentary film Silicon  Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].