SQL Like a Pro: Please Stop Using Distinct!!

Every time I see a “DISTINCT” I ask the same question: Why??

5 min readMar 27, 2022

Humans are funny, we tend to do all sorts of assumptions. One assumption I see again and again, is the concept that “DISTINCT” is some kind of black magic keyword which resolves all your “duplication” issues. What if I told you you’re probably wrong? Not only are you wrong, but you’re also burning large amounts of compute/money just to still be wrong in the end?

“DISTINCT”: An Expensive Keyword

If you think about it, for the database to guarantee there are no duplicates within a dataset, it needs to compare all the rows to each other, weeding out the repeating rows. This requires:

a lot of CPU — comparing large amounts of rows is hard work…
loads of memory to store all rows — all rows need to be compared to each other, which means you need them all available in memory (even if only a hash, depending on low level algorithm implementation)
reduced parallelism capabilities — even if you can parallelise part of this computation, it’s a hard problem to parallelise in an efficient way

“DISTINCT” is a really expensive keyword, and worst, it is misused most of the…

SQL Like a Pro: Please Stop Using Distinct!!

Every time I see a “DISTINCT” I ask the same question: Why??

“DISTINCT”: An Expensive Keyword

Written by Joao Marques @ Data Beyond Ltd