Facebook AI Research Open Sources ML Framework for Online Speech Recognition
- By John K. Waters
Facebook's AI Research (FAIR) group has open sourced its wav2letter@anywhere inference framework for online speech recognition, the company announced. This release builds on FAIR's previous release of wav2letter and wav2letter++.
Wav2letter@anywhere is a multithreaded and multiplatform library aimed at researchers, production engineers and students who need to put together trained deep neural network (DNN) modules for online inference quickly.
"Online speech recognition" is the process of transcribing speech in real-time from an input audio stream. It's the "real-time" aspect that's not addressed by typical Automatic Speech Recognition (ASR) systems, FAIR researchers Vineel Pratap and Ronan Collobert explained in a blog post. For applications such as live video captioning or on-device transcriptions, reducing the latency between the audio and the corresponding transcription is critical.
"Most existing online speech recognition solutions support only recurrent neural networks (RNNs)," they wrote. "For wav2letter@anywhere, we use a fully convolutional acoustic model instead, which results in a 3x throughput improvement on certain inference models and state-of-the-art performance on LibriSpeech."
The framework provides streaming API inference that is efficient yet modular enough to handle various types of speech recognition models. It supports concurrent audio streams, which are necessary for high throughput when performing tasks at production scale. The API should be flexible enough to be easily used on different platforms (personal computers, iOS, Android, et cetera).
Written in C++, wav2letter@anywhere is part of the wav2letter++ repository. It comes with a modular streaming API that allows the framework to support various models, including recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which are faster. It is a standalone repository that can be embedded anywhere, the researchers said. And it uses efficient back ends, such as FBGEMM, and specific routines for iOS and Android.
"From the beginning, it was developed with streaming in mind," Pratap and Collobert wrote, "unlike some alternatives that rely on generic inference pipeline, allowing us to implement an efficient memory allocation design."
"We have made extensive improvements since open-sourcing wav2letter++ a year ago," they added, "including beefing up decoder performance (10x speedup on seq2seq decoding); adding Python bindings for features, decoder, criterions, etc.; and better documentation. We believe wav2letter@anywhere represents another leap forward by enabling online speech recognition and significantly reducing the latency between audio and transcription. We are excited to share the open source framework with the community."
There's more information about wav2letter@anywhere available in a paper and a wiki.
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at firstname.lastname@example.org.