Study: Data Silos Continue to Thwart Enterprise AI Projects

What's the most common obstacle currently facing enterprises struggling with their AI initiatives? According the preliminary findings of a research study commissioned by Databricks, it's data silos -- which shouldn't be surprising. Data-related challenges are commonly reported by companies moving AI to production, and the issue was cited by 96 percent of organizations in this study.

"To derive value from AI, enterprises are dependent on their existing data and ability to iteratively do machine learning on massive data sets," said Databricks co-founder and CEO Ali Ghodsi, in a statement. "Today's data engineers and data scientists use numerous, disconnected tools to accomplish this, including a zoo of machine learning frameworks. Both organizational and technology silos create friction and slow down projects, becoming an impediment to the highly iterative nature of AI projects. Unified Analytics is the way to increase collaboration between data engineers and data scientists and unify data processing and AI technologies."

Databricks, which was founded by the team behind the original UC Berkeley project that would become Apache Spark, unveiled the preliminary results of the study last week at the annual Spark+AI Summit 2018 in San Francisco, along with a major update of its Spark-based unified analytics platform designed to unlock the data silos impeding enterprise AI innovation.

The list of new capabilities in the platform includes MLflow, an open source, cross-cloud framework designed to simplify the machine learning workflow, making it possible for organizations to package their code for reproducible runs, execute and compare hundreds of parallel experiments, leverage any hardware or software platform, and deploy models to production on a variety of serving platforms. The feature integrates with Apache Spark, SciKit-Learn, TensorFlow, and other open source ML frameworks.

The company developed MLflow, said Databricks CTO Matei Zaharia, because there is no framework for machine learning, which is forcing organizations to piece together point solutions and secure highly specialized skills to achieve AI.

"Everybody who has done machine learning knows that the machine learning development lifecycle is very complex," Zaharia said during his summit keynote. "There are a lot of issues that come up that you don't have in normal software development lifecycle."

The company announced two other new features at the event: Databricks Runtime for ML, which simplifies distributed machine learning with pre-configured environments integrated with popular machine learning frameworks, such as Tensorflow, Keras, xgboost and scikit-learn; and Databricks Delta, which extends Apache Spark to simplify data engineering for data reliability and performance at scale.

Billed as the largest event for the Apache Spark community, the Spark+AI Summit drew an estimated 4,000 data scientists, engineers and analytics leaders to San Francisco's Moscone Center.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at