SQL - Order of Execution Benefits

Hendrik Schultze
2 min readMar 16, 2021

Understanding how to optimize load processes in HQL/SQL for a better utilization of your cluster.

From time to time I dump some random knowledge to colleagues and I thought this is maybe also relevant for you : )

SQL is a highly structured language, which means it is important to stick to the rules. On the other hand, everything that is not forbidden — is allowed.

Usually, you will order a query like this:

  1. SELECT
  2. FROM
  3. JOIN or OUTER JOIN with ON
  4. WHERE
  5. GROUP BY and optionally HAVING
  6. ORDER BY

But the order of execution is slightly different:

  1. FROM
  2. JOIN or OUTER JOIN with ON
  3. WHERE
  4. GROUP BY and optionally HAVING
  5. SELECT
  6. ORDER BY

Since the FROM is executed before the SELECT we can use the data for multiple operations — at once!

Imagine you work with a big table, like user events, click or access logs and you want to extract data that can not be filtered out by pushing down predicates. This means for every time you want to extract a specific set of data, a full table scan is performed. That is quite expensive, slow, and can be avoided.

Let us populate two tables from the same source. The source tables contain 1000000000 entries.

So in total, we had to iterate twice over the same data source which is an unnecessary waste of resources.

We can optimize that, by merging both queries into one.

By making use of the order of execution, we half the amount of entries we iterate and therefore can save resources on our cluster.

You will notice that the data is still moved in a sequence way after the first steps are executed. To enable the parallel execution on a cluster, you have to enable parallel execution for Hive.

set hive.exec.parallel=true;

You can see, with this simple example already how you half the amount of processed data. I hope this will help anyone in the future. : )

--

--