SQL Like a Pro: Please Stop Using Distinct!!
Every time I see a “DISTINCT” I ask the same question: Why??
Humans are funny, we tend to do all sorts of assumptions. One assumption I see again and again, is the concept that “DISTINCT” is some kind of black magic keyword which resolves all your “duplication” issues. What if I told you you’re probably wrong? Not only are you wrong, but you’re also burning large amounts of compute/money just to still be wrong in the end?
“DISTINCT”: An Expensive Keyword
If you think about it, for the database to guarantee there are no duplicates within a dataset, it needs to compare all the rows to each other, weeding out the repeating rows. This requires:
- a lot of CPU — comparing large amounts of rows is hard work…
- loads of memory to store all rows — all rows need to be compared to each other, which means you need them all available in memory (even if only a hash, depending on low level algorithm implementation)
- reduced parallelism capabilities — even if you can parallelise part of this computation, it’s a hard problem to parallelise in an efficient way
“DISTINCT” is a really expensive keyword, and worst, it is misused most of the…