Azure Synapse Serverless SQL Pools - how to optimize transformations u - Enhance your coding expertise with Phil on @onlycoders.net

1 year ago

#389305

Phil

Azure Synapse Serverless SQL Pools - how to optimize transformations using notebooks and load tables into ADLSG2

We use Synapse Notebooks to perform data transformations and load the data into fact and dimension tables within our ADLSG2 data lake. We are disappointed with the time it takes to load the transformed tables to the data lake. We don't know how to get the Notebooks to run more quickly or in parallel.

Some of the things we have tried are as follows:

Collocated Synapse with our Data Lake
Switched to Premium Data Lake Storage with solid state drives
Cleaned up cached temporary SQL views and data frames the moment they are no longer needed
use PySpark (Python) functions for better performance
use Serverless SQL pools to minimize costs (we've tried small, medium and large sized spark pools)
use unmanaged external tables with parquet format
tried using nested notebooks
tried using pipelines that invoke notebooks in parallel (just learned that this is not possible https://learn.microsoft.com/en-us/azure/synapse-analytics/synapse-notebook-activity?tabs=classical)

We are not dealing with large amounts of data:

Bronze - 110 tables (3 GB) ~1,000 blobs
Silver - 14 tables (9 GB) ~16,000 blobs
Gold - 66 tables (4 GB) ~10,000 blobs

In our SQL Managed instance, the same transformations (using SQL Stored procedures) are completed in approximately 15 minutes. Using Synapse notebooks along with premium data lake storage, we get the following timings to load the data model:

Small Spark Pool (4 vCores/32 GB) 3-10 nodes - 38 minutes
Medium Spark Pool (8 vCores/64GB) 3-50 nodes - 30 minutes
Large Spark Pool (16 vCores/128 GB) 3-70 nodes - 25 minutes

The cost to run the large and medium spark pool is very prohibitive. Our goal is to refresh our data model every half hour.

How can we optimize the notebooks? The bottleneck is the time it takes to write the tables to the data lake.

Is there any way to run the loading of the tables in parallel instead of sequentially? Should we abandon serverless SQL pools and use dedicated ones instead? Any suggestions would be greatly appreciated.

Thanks

UPDATE:

As mentioned by @Veikko, the root cause for poor performance is the transformations using stored procedure-like logic, not the time it takes to load the data into the data lake. The bulk of time is spent with loading the data frames from the data lake and performing the sql transformations.

To test this, I ran two notebooks that loaded identical amounts of data into the data lake:

Notebook 1:

defined 4 functions
loaded 29 dataframes into memory
converted the dataframes into temporary sql views
created 21 sql temporary views which perform the transformations
created the table in the data lake (fact_equipment)
dropped the temporary sql views and deleted the dataframes from memory

Total time = 5 minutes

Notebook 2:

defined 2 functions
loaded 1 dataframe into memory
converted the dataframe into a temporary sql view
created 1 sql temporary view (select * from fact_equipment)
created the table in the data lake (fact_equipment)

Total time = 1 minute

So, my primary issue is with how to speed up the transformations and my secondary issue is with finding a way to run notebooks in parallel.

python

dataframe

apache-spark

jupyter-notebook

azure-synapse

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs