1 year ago
#389305
Phil
Azure Synapse Serverless SQL Pools - how to optimize transformations using notebooks and load tables into ADLSG2
We use Synapse Notebooks to perform data transformations and load the data into fact and dimension tables within our ADLSG2 data lake. We are disappointed with the time it takes to load the transformed tables to the data lake. We don't know how to get the Notebooks to run more quickly or in parallel.
Some of the things we have tried are as follows:
- Collocated Synapse with our Data Lake
- Switched to Premium Data Lake Storage with solid state drives
- Cleaned up cached temporary SQL views and data frames the moment they are no longer needed
- use PySpark (Python) functions for better performance
- use Serverless SQL pools to minimize costs (we've tried small, medium and large sized spark pools)
- use unmanaged external tables with parquet format
- tried using nested notebooks
- tried using pipelines that invoke notebooks in parallel (just learned that this is not possible https://learn.microsoft.com/en-us/azure/synapse-analytics/synapse-notebook-activity?tabs=classical)
We are not dealing with large amounts of data:
- Bronze - 110 tables (3 GB) ~1,000 blobs
- Silver - 14 tables (9 GB) ~16,000 blobs
- Gold - 66 tables (4 GB) ~10,000 blobs
In our SQL Managed instance, the same transformations (using SQL Stored procedures) are completed in approximately 15 minutes. Using Synapse notebooks along with premium data lake storage, we get the following timings to load the data model:
- Small Spark Pool (4 vCores/32 GB) 3-10 nodes - 38 minutes
- Medium Spark Pool (8 vCores/64GB) 3-50 nodes - 30 minutes
- Large Spark Pool (16 vCores/128 GB) 3-70 nodes - 25 minutes
The cost to run the large and medium spark pool is very prohibitive. Our goal is to refresh our data model every half hour.
How can we optimize the notebooks? The bottleneck is the time it takes to write the tables to the data lake.
Is there any way to run the loading of the tables in parallel instead of sequentially? Should we abandon serverless SQL pools and use dedicated ones instead? Any suggestions would be greatly appreciated.
Thanks
UPDATE:
As mentioned by @Veikko, the root cause for poor performance is the transformations using stored procedure-like logic, not the time it takes to load the data into the data lake. The bulk of time is spent with loading the data frames from the data lake and performing the sql transformations.
To test this, I ran two notebooks that loaded identical amounts of data into the data lake:
Notebook 1:
- defined 4 functions
- loaded 29 dataframes into memory
- converted the dataframes into temporary sql views
- created 21 sql temporary views which perform the transformations
- created the table in the data lake (fact_equipment)
- dropped the temporary sql views and deleted the dataframes from memory
Total time = 5 minutes
Notebook 2:
- defined 2 functions
- loaded 1 dataframe into memory
- converted the dataframe into a temporary sql view
- created 1 sql temporary view (select * from fact_equipment)
- created the table in the data lake (fact_equipment)
Total time = 1 minute
So, my primary issue is with how to speed up the transformations and my secondary issue is with finding a way to run notebooks in parallel.
python
dataframe
apache-spark
jupyter-notebook
azure-synapse
0 Answers
Your Answer