Data Toboggan 2022
SQL Serverless allows you to add structured schemas over standard file formats in your data lake storage. For all the benefits this brings, one of the trade offs is that your schema is only enforced "on read", rather than "on write".
So if it's your Synapse Pipelines that are responsible for writing/updating the data that is stored in your data lake, then they should also be responsible for updating the schema of the SQL Serverless views that you use to query it.
In this session, James Broome will show a simple way to dynamically update SQL Serverless views from with a Synapse Pipeline so that they're always up to date with the data in your data lake.
Hello, my name is James Broome. I'm the Director of Engineering at a UK based technology consultancy called endjin. We're Microsoft Gold Partners for Cloud, for DevOps, for Data Platform, for Data & Analytics, and also Power BI partners. That means we do lots of work in Azure for our customers, specifically around data & analytics. And over the past couple of years, that's meant lots of work inside Azure Synapse.
For the next 10 minutes, I'm going to talk specifically about one feature of Synapse called Synapse SQL Serverless. I am going to talk about how you can use Synapse SQL Serverless to create structured views over your files in your data lake, and I'm going to take it a step further and show you how you can dynamically create those views using a combination of stored procedures and Synapse Pipelines. I'm going to talk about three specific use cases as to why you might need to consider doing that based on scenarios and experiences that we've had whilst delivering Syanpse projects into production over the past couple of years.
I'm going to start though by switching into Synapse Studio and reminding ourselves what Synapse SQL Serverless actually is...
Creating Synapse SQL Serverless views
So over here, we've got a CSV file. This is a file on our data lake. If we right click, we can do "New SQL Script" and do "SELECT TOP 100 ROWS" and we can hit run. And immediately we get some data back, which is the data inside our CSV file. Now what SQL Serverless has done here is its pre-populated a SQL script using the
OPENROWSET command to basically query over that CSV file. I haven't provisioned a database, I've just deployed a Synapse Workspace and SQL Serverless is a tool inside the Synapse Studio toolbox, if you like, it's just there out of the box. It's not actually a database, but it gives us the ability to query the files in a structured way using standard T-SQL syntax. So if you've done anything with SQL Server as your SQL DB, for example, this should all be very familiar.
Now, that's great, but what if we decide that we actually want to turn this structure we've got here into something a bit more permanent, we want create a view over that data. We can take it a step further, we can use this SQL syntax here and wrap it in a standard
CREATE VIEW statement. And we can run that using the same
OPENROWSET command in the middle, and if we navigate to our SQL Serverless pool instance and refresh our views, we can see that we now have an
Orders view, which we can query using very standard
SELECT * from our
[dbo.Orders] view, and we get the same data.
Reasons for creating views dynamically
Okay. Happy days. Now, that's great, but I talked about being able to create these views dynamically. Why might you want to do that? And there's three reasons that we've experienced in projects that we've delivered for our customers:
1. Lots of views to create
The first one is this. At the minute we've created one view over one file for Orders. What if our data lake contains lots and lots of different types of files? Imagine there's flat file extracts from our ERP or central back office system. We've got orders, we might have products and customers, reviews, returns, all sorts of different things. There may be tens, hundreds of different file types. So we don't want to manage very boilerplate code to create all these views over all these different types of data. So, we might want to use a convention-based approach to loop through all the kind of folders in our data lake and create views accordingly based on some kind of pattern match.
2. DevOps deployment needs to create views
The second reason is that an interesting point to note is that the schema in SQL Serverless views is enforced when the files are read. It's not enforced when the data is added to the lake - this isn't a database table, we're not trying to get data in to a specific schema. But when the view is created, which is when the file is queried, that's when the schema is basically being read and essentially enforced.
What that means is, if you have different environments - say you've got a dev environment, you've got a test and a prod environment, quite a standard DTAP topology - if we want to push our changes through those environments, we want to push our schema from Dev into Test into Prod, or just deploy into any other environment, we can't actually do that. We can't create these views until that data is in the place that we want it to be. So the second scenario is we might want to be able to create these views programmatically, once something else has happened before - e.g. once the ETL process has run and our data has been projected out, at that point we can then create the views.
So that means our end to end deployment process can't be done in one step. It needs to be done by deploying the initial infrastructure then running some kind of process, and then finally deploying the views once the data is in the right place.
3. The schema is evolving
And the third reason is actually similar, related to the fact that we we don't enforce our schema until the data is read. The third reason is that our schema is changing over time - we have an evolving schema. It maybe for all sorts of reasons, it may be completely expected, it may be during the development phase of your data platform. On day one, you may expect to know 10 columns in your CSV file, and then the next week, actually you need another 3. So being able to create these views dynamically, again, maybe at the end of the process once the data has been spit out at the other end, let's just recreate those views every time based on what's in there, it can be a very low maintenance time-saving solution. Basically you delegate the responsibility of creating the schema of the views to the thing that's creating the data projections that you want to create the views over as well.
All these three scenarios are things that we've experienced in real life. And the nice thing is, the approach I'm going to show you how to do this, applies to all of those scenarios. So let's go and look at that further...
Creating a parameterised, convention-based CREATE VIEW statement
So we've got our view here, but first of all let's just delete our Orders view. I've got a script to do that - "Drop View" - so if we start again that's now gone. Our view to create the orders is tightly coupled to our CSV file at the minute, it's just going to use that. But, what we can do is create a view that's not tightly coupled and is actually convention-based and parameterised.
So this bit here, what we have to actually do is wrap it in a stored procedure (and you can create stored procedures inside SQL Serverless) so we've got a stored procedure called
CreateSQLServerlessView and what I'm doing is just passing in a single character, which is
@Name, and that name is being passed into something we all know and love - a bit of dynamic SQL. The name gets passed into here, and we're basically building up dynamically the statement to create the view using the name. We've still got the
OPENROWSET command, and we pass in the name into the path in the data lake. So we're relying on the convention that our orders would live under an orders folder, products will live in a products folder, etc, etc. So if we had one view, or if we had a hundred views, the same thing would apply.
If we create this stored procedure, the stored procedure lives in SQL Serverless and executing it would just be as you expect. So if we run this, we now execute our stored procedure for Orders. And if we go back to our SQL Serverless pool, we now have exactly the same thing again.
So that's step one. Now we've got a convention-based store procedure, that means we can create a view based on any kind of folder path that we pass in. Clearly your own use cases might be different - you might want to change the convention and change the pattern matching, and you might need more parameters than just
@Name, but you get the principle of what we can do.
Let's drop this view again. And refresh this again so that it's gone.
Executing a SQL Serverless stored procedure from Synapse Pipelines
So, the second step is - now we've got the stored procedure, how do we execute it? And this is interesting because the Synapse API doesn't support being able to execute the SQL query. It supports pushing those SQL scripts into the Synapse workspace so that they are they to be triggered or to be run, but we can't actually execute it through the API. What we can do though, is use standard command line SQL tooling, like
sqlcmd, but also we could do something that's in-the-box in Synapse, we could use Synapse Pipelines.
So if we go over here, we've got a pipeline. The first pipeline I'm going to show you is quite simple - it has one activity, and it's a stored procedure activity. Now the stored procedure activity is using a Linked Service which I've already set up. The Linked Service is actually based on a SQL Server connection pointing to the endpoint for Synapse SQL Serverless that you can obtain from the the management pane in Azure when you've deployed the workspace. I'm just pointing to the Synapse endpoint and the database I care about, and using the Synapse Managed Identity to connect, and you can test that connection to prove it works.
Interestingly, there is an out-of-the box stored procedure activity for Synapse, but it only works with the dedicated SQL pools. So you want to do this with SQL Serverless, you have to go down this route. Once we've got that Linked Server set up though we can then call the stored procedure, which creates our SQL Serverless view. And in this example, I'm passing in one parameter, which is our
@Name parameter, and hard-coding it to "orders".
So if we trigger that from the sales pipeline, basically the effect is the same - the pipeline triggers the stored procedure passing in "orders" and we should end up with one view again. We can see that succeeded already, let's go back into our database, let's refresh the views again, and now we can see all the views again, looking exactly the same, there we go.
Let's drop that view one last time.
So that first pipeline is really useful for when you have an entire set of views. I passed in one parameter, but clearly you could extend that further because you could you could pass in a pipeline variable that had a whole list of strings - you could have 20 views you want to create, you could pass that in and loop through them and do it for each of them.
Using dynamic metadata from the data lake file system
Now we can extend that slightly further if I show you the next pipeline. If you want to do that dynamically based on the folder structure, we can extend the pipeline to use a metadata activity that's actually pointing to the data lake as the file system. And it's pointing, in this case, to the parent folder of where our orders folder was. It's getting all the child items, that's getting passed into a For Each activity, and the For Each activity is basically, using the parameterized notation, getting the child items for every folder in that parent folder. It's calling into the same "Create SQL Serverless View" activity. And in this case, it's dynamic because it's taking the
@item.Names. So basically it's going to read all the folders, and for each folder, call into the same activity and pass in the folder name and then call the stored procedure.
So if we run that one just to show how that works. We've only got one folder in the data lake so that the results should be the same. That's running, and is probably finished already. Here we go - it's in progress. If we dig in, we can see a bit more. There you go - we've got the folder list for each, so there's only one "Create SQL Serverless View". Let's go back into a data lake database, I didn't refresh it before, but it's still here.
So now we could trigger that at the end of something - we could now trigger over an entire data lake set of folders and create the view for each of them.
The final thing I want to show you is actually what might happen, as you've probably got some kind of ETL process before all of that, and that cannot happen until the end. So if we dig into this pipeline, the ETL process could be as complicated as as you like, it could be all sorts of stuff. In this case, I've just got a Copy activity. And then just to show you, I've got an If condition which might be that we want to choose whether to update the scheme or not, depending on whether we know that the scheme has evolved or not. And if that was true, we could then call into the same process. If not, then we could do something else. We could have some kind of logical gate as to whether we want the views to be dynamically created at the end.
So it's really that simple - using a combination of a stored procedure and a pipeline, we can dynamically create SQL Serverless views we use over our data lake, and I've explained three reasons why you might want to do that.
Just quickly before I wrap up, there's two things to be aware of - two gotchas if you like.
The first one is, if we look at our view, we're relying on Synapse to basically infer the schema, because we're not hard-coding a schema, and if you dig into this in slightly more detail, you can see it's not particularly optimized, especially when looking at the
VARCHAR columns, which default to
VARCHAR 8000. Clearly there's a potential performance impact of that. If your data is a lot smaller, you could extend that stored procedure quite easily to be a bit more granular in terms of what it's going to do based on the kind of data type you're expecting, so there'sways to tweak that.
The second thing is to talk about the schema evolution. If you are expecting your schema to change a lot, and you want to manage that properly, then your mileage may vary with this approach and you may want to start looking at something like Delta Lake and using Spark, because there's a lot more support for that type of scenario in that technology.
That brings us to the end of this talk, hopefully you found it useful. Hopefully you've realised how flexible and powerful the combination of SQL Serverless and dynamic view creation inside Synapse Pipelines can be. So with that, I'll just say, thanks for listening and have a great day.