Oracle Open Sources its Tribuo Java-based Machine Learning Library
- By John K. Waters
Oracle has released an in-house developed Java machine learning (ML) library to open source, the company announced this week. Called Tribuo, it was written in Java and runs on Java 8 or later. It was open sourced under an Apache 2.0 license, and its available on GitHub.
Tribuo (the name is Latin for "assigning or apportioning values") was developed by the Oracle Labs Machine Learning Research Group to provide standard ML functionality in Java, including classification, clustering, anomaly detection, and regression algorithms. The library has data loading pipelines, text processing pipelines, and feature level transformations for operating on data once it's been loaded in. It's also got a full suite of evaluations for each of the supported prediction tasks.
"Unlike other systems, Tribuo knows what its inputs are, and can describe the range and type of each input," explained Adam Pocock, principal member of the Oracle Labs technical staff, in a blog post. "Each feature is named, so you can't confuse it for another feature just because the input processing system gave it the same id number (in fact in Tribuo you don't ever need to see its id number). This means a Tribuo Model knows when you've given it features it's never seen before, which is particularly useful when working with natural language processing. Tribuo's models also know what their outputs are, and those outputs are strongly typed. No more staring at a float wondering if it's a probability, a regressed value, or a cluster id; in Tribuo each of these is a separate type, and the model can describe the types and ranges it knows about."
Oracle Labs, which is the research and development branch of the company, has a long and venerable history. Formerly Sun Microsystems Laboratories, or Sun Labs, it was already two decades old when Oracle acquired Sun in 2010. The organization engages in academic research, advises Oracle's product managers, and builds proof of concepts and demo systems, Pocock, told Pure AI.
"It's basically a pile of PhDs in computer science," he said. "Oracle product groups can come to us if they need our particular kind of expertise, and we're there to help."
Pocock, who is the lead developer on Tribuo, joined Oracle Labs in 2012 after earning his PhD from the University of Manchester in the UK. (His PhD thesis, "Feature Selection via Joint Likelihood," was awarded the BCS Distinguished Dissertation prize for 2013, a prize given to the best PhD thesis in Computer Science in the UK from the previous year.) He is a member of the Information Retrieval and Machine Learning group in the labs, which focuses on ML technologies.
Pocock and his colleagues started working on what would become Tribuo in 2016. "Java is very important to Oracle, and lot of things are written in Java, and we wanted to integrate well with those things," he said. "But we found that there wasn't really a particularly good library fit for ML systems in Java. If you want to build a large distributed system, Apache Spark is great for that. But if you want to have a machine learning widget that lives inside every button in your UI, so you can learn from that user's input and predict what they're going to do, so you can make their life easier, then Spark isn't so great for that."
"We just found that there wasn't really anything in the marketplace that was a good fit," he added. "So we built it, and we began using it inside Oracle in about 2017."
The Oracle researchers were filling a "crucial gap between the expectations of an
enterprise system and the features provided by most ML libraries," Pocock said.
"Large software systems want to use building blocks, which describe themselves and know when their inputs or outputs are invalid," he explained in his blog post. "In contrast, most ML libraries expect a pile of float arrays to train a model. Then at deployment time, they expect the input to be a float array, and they produce yet another float array as the predicted output. The description of what any of these arrays mean, or what the input/output floats should look like is left to another system, either a wiki, a bug tracker, or written as a code comment. We don't think developers want to add yet another database table per ML model just to explain what that array of output floats means."
"A lot of machine learning libraries are written in Python by academics who are not focused on the needs of software engineers who are dealing with enterprise-grade software," Pocock added during the interview. "We think machine learning libraries should be easy to interact with. It should expect objects and it should produce objects; with Java, you're in an object-oriented world. We think those objects should be self-describing. Models should know what kind of inputs are expected, and what kind of outputs it's going to product. You should be able to ask it, and it should tell you."
Tribuo provides interfaces to ONNX Runtime, TensorFlow, and XGBoost, which allows models stored in onnx format, or trained in TensorFlow and XGBoost to be deployed alongside Tribuo's native models. "The onnx model support is particularly exciting," Pocock said, "as it allows the deployment in Java of models trained using popular Python packages like scikit-learn and pytorch."
The next step, of course, is to call for participation in the new Tribuo community. The project is accepting code contributions to Tribuo under the Oracle Contributor Agreement, and more details are available in our GitHub docs.
"We think there's a need for something like Tribuo, and we're hoping the community thinks so, too," Pocock said.
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at email@example.com.