SQL Like a Pro: Query Tuning #1
If you ever got fed up with queries that run for hours… this one is for you!
I started tuning SQL queries very early in my career, being that ETL scripts or business user queries. There was something highly rewarding about taking a SQL query runtime from 10 minutes to 30 seconds, and deep diving into how the database worked, understanding its core query planner behaviours and how to adjust the queries for performance was fascinating. As my career went on, tuning was always part of my work, and I can honestly say I must have tuned hundreds of queries, and saved hundreds of hours of compute.
As the data world move towards “pay for compute” models (snowflake, big query, redshift serverless, spark, …), tuning queries can now be measured in dollars and pounds, and that is even more motivating. Just this week, I tuned a Snowflake query from $3,000/year to $180/year, how cool is that!
The challenge with query tuning though is only a very small set of rules actually work everywhere… that made me think it would be more interesting to show you a few real world examples, rather than generic abstract rules which may work some times.
This will be a series of (anonymized) real world queries which I will tune, explain the reasoning behind it, and share the results with you.
DISCLAIMER: No two queries are the same, so if you came here looking for the holy book of query tuning, I’m afraid you will be disappointed… but don’t dispair just yet, read on! Because practice makes perfect.
Tuning is highly variable with the technology you are using, so I would like to start by defining the applicability of these techniques.
These techniques should be applicable to any database with:
- Distributed Data (data is stored is a distributed file system, either locally or remotely)
- Data is stored in Column Store, and compressed