If you've been keeping up with Build you are probably aware of the announcements around Azure Synapse Analytics. If you have, or are about to invest in Databricks the new Synapse Spark offering is likely to have grabbed your attention and rightly so. Why is Microsoft putting yet another Spark offering on the table and what does it mean for me?
A little over two years ago Microsoft announced the general availability of Microsoft Azure Databricks to great fanfare and promise. Spark was already offered on HD Insight, but Databricks was different. It came offered as a PaaS service, promising less configuration, faster performance, and a compelling collaborative notebook experience. Databricks was becoming a trusted brand and providing it as a managed service on Azure seemed like a sensible move for both parties. Microsoft went into full marketing overdrive, they pitched it as the solution to almost every analytical problem and were keen stress how well it integrated into the wide Azure data ecosystem. Reality soon started to follow with tighter integration with AAD and Azure Data Factory. Databricks as pitched at the heart of the Azure Data Platform, sucking up data, transforming it and spitting it out, usually into a SQL Data Warehouse.
Not long after it became clear that Azure Data Lake Analytics, an alternative Azure service, no longer had a place in Microsoft's future data strategy. This came much to the annoyance of many who had bet on the consumption-based SQL/.NET service. Instead, people were told to re-skill in Python and to join the Databricks party - or get left behind on a stagnating platform.
In February 2019 it was announced that Microsoft was part of a $250m investment in Databricks, appearing to re-enforce it's commitment to the platform.
So why has Microsoft now decided it is time for another Spark service and what does this mean for Databricks?
Azure Synapse vs. Azure Databricks
Perhaps the relationship with Databricks meant that Microsoft could not innovate at the pace they wanted to. Databricks, after all, are keen to be seen as cloud agnostic and need to invest in areas that fulfil the greatest market need. On the other hand why should Microsoft help Databricks be successful on their competitors platforms?
Regardless of the why, Microsoft has decided not to incorporate Databricks within Azure Synapse Analytics, the greatest shake up of their data offerings in the history of Azure. Instead there is a new managed Spark service. Azure Synapse Spark, known as Spark Pools, is based on Apache Spark and provides tight integration with other Synapse services.
Just like Databricks, Azure Synapse Spark comes with a collaborative notebook experience based on nteract and .NET developers once again have something to cheer about with .NET notebooks supported out of the box. Traditional Spark developers will be pleased to know that Spark Pools come pre-loaded with Anaconda libraries offering over 200 libraries for machine learning, data analysis, visualization. It also comes with support for Delta Lake 0.6.0.
But things start to get really interesting with the integration story. Getting up and running with Spark is seamless, AAD identity, access management and security has been baked in from the start. No longer is mounting a Data Lake Storage account a pain, you simply specify the storage account when you create a Synapse workspace and it becomes immediately available for querying. Synapse analytics has it's own managed identity making it easy and intuitive to manage access.
The new unified Synapse Studio development experience offers tight integration with Spark. You can visually navigate your Data Lake Storage account, and immediately start querying files through helpful context menus. You can even query Spark tables from the new SQL Serverless service. I'm sure we will see even more integration with Synapse services in the future meaning teams are more productive.
In terms of performance I can only assume Microsoft engineers will be working hard to optimise for their specific infrastructure. Being unshackled from the constraints of running on multiple clouds, we may see Spark on Synapse out performing Databricks soon.
So is Databricks on Azure dead? I very much doubt Azure Databricks is going anywhere soon. Microsoft has been burnt here before and I think they will be very careful to manage the messaging around this. If you have Spark environments spread across multiple clouds then Databricks is clearly a good choice, it's a great platform and it will continue to be a compelling option. However, if I were starting a Spark project on Azure today I would definitely start by looking at Azure Synapse Analytics. It already feels like the most productive and cohesive cloud Spark offering out there.
Want to get started with Synapse but not sure where to start?
We also have created number of talks about Azure Synapse:
- Serverless data prep using SQL on demand and Synapse Pipelines
- Azure Synapse - On-Demand Serverless Compute and Querying
- Detecting Anomalies in IoT Telemetry with Azure Synapse Analytics
- Custom C# Spark Jobs in Azure Synapse
- Custom Scala Spark Jobs in Azure Synapse
Finally, if you are interested in more content about Azure Synapse, we have a dedicated editions page which collates all our blog posts.