SQL Like a Pro: Please Stop Using Distinct!!

Every time I see a “DISTINCT” I ask the same question: Why??

Joao Marques @ Data Beyond Ltd
5 min readMar 27, 2022
Photo by Joshua Hoehne on Unsplash

Humans are funny, we tend to do all sorts of assumptions. One assumption I see again and again, is the concept that “DISTINCT” is some kind of black magic keyword which resolves all your “duplication” issues. What if I told you you’re probably wrong? Not only are you wrong, but you’re also burning large amounts of compute/money just to still be wrong in the end?

“DISTINCT”: An Expensive Keyword

If you think about it, for the database to guarantee there are no duplicates within a dataset, it needs to compare all the rows to each other, weeding out the repeating rows. This requires:

  • a lot of CPU — comparing large amounts of rows is hard work…
  • loads of memory to store all rows — all rows need to be compared to each other, which means you need them all available in memory (even if only a hash, depending on low level algorithm implementation)
  • reduced parallelism capabilities — even if you can parallelise part of this computation, it’s a hard problem to parallelise in an efficient way

“DISTINCT” is a really expensive keyword, and worst, it is misused most of the…

--

--

Joao Marques @ Data Beyond Ltd

Just a technical Data Architect having too much fun building new data platforms! Want to support my writing? Why not BuyMeACoffee.com/joao.marques.db