Build 2019: Cosmos DB Comes Alive for AI with Spark API

One of the killer features about Microsoft's Azure Cosmos DB is that it supports multiple APIs for different types of data. Whether you are storing JSON, key value data or graph data, Cosmos DB has an engine to support your workloads. The one thing that has been missing is an engine that supports any sort of real machine learning or artificial intelligence (AI) capabilities. At Build 2019 this week, that changed when Microsoft introduced support for a Spark API for Azure Cosmos DB. This will enable Cosmos DB as a platform to support hybrid transaction and analytic processing workloads (HTAP).

Spark: What Is It?
Apache Spark is an engine for big data and machine learning. It's much faster than Hadoop for big data processing, by its use of in-memory computing and other optimizations. Spark also has a common set of APIs, including built-in machine learning capabilities, and support for R, Python, and Scala options. It is commonly implemented on both Azure and AWS through the managed Databricks service, which allows users to come up to speed very quickly.

Benefits of Cosmos and Spark Together
The new API is a full implementation of Spark, which will allow querying of data in Cosmos DB containers and return results a Spark data frames. This comes combined with Cosmos DB's fully global data distribution, and 99.999 percent uptime guarantee. The other benefits include faster time to insight for globally distributed users and data. It also greatly simplifies your analytics architecture by reducing data movement and placing both your data and analytics in a single place.

You can use the built-in Apache Spark runtimes for your AI processing: These include Spark MLLib, Microsoft Machine Learning for Spark, Azure ML and Cognitive Services.

In addition to machine learning and AI workloads, you can use the SQL or Cassandra APIs against Cosmos for your transactional processing needs. You can even use this arrangement of data to perform extract, transform and loading workloads -- Spark has become popular for this, and you can elastically scale your cluster to meet the workload needs of these requirements. This helps bridge the common transactional and analytics gap, allowing you to have faster time to insight.

Now with Notebooks
Much of the development work with Spark is done using Jupyter notebooks, a technology which allows queries and code to be annotated with markdown language, and easily shared across teams. Microsoft has recently added support to Azure Data Studio for SQL notebooks, and as part of the Spark announcement for Cosmos DB, has announced notebook support for CosmosDB. These notebooks will be available in the Cosmos explorer, both in the Azure Portal, as well as at here. These notebooks will support Spark, and they will also support the other Cosmos DB APIs.

Notebooks are such a big win for Cosmos because they allow developers to share code with business analysts to improve and speed development cycles. The annotation capabilities allow for users to enhance queries and results with more context.

Etcd Support
If that wasn't enough, Cosmos is also adding support for the etcd API, which will allow for Kubernetes persisted storage to be persisted in Cosmos. This can be done without having to discretely provision cloud storage, and Azure Kubernetes Service supports this configuration natively.

For more on Microsoft's Build 2019 announcements, read a round up of the AI-related announcements here and Joey's coverage of Microsoft's data news here.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.