SQL for Tableau Part 4: Order of Operations


Welcome back to our series on SQL for Tableau Users. If you missed the first three parts of the series, be sure to check them out here:


For part 4, I’m handing over the reigns to Robert Crocker, who will be discussing the SQL Order of Operations. Robert is a remote data visualization designer/developer, testing the hypothesis of remote productivity around the world for over a year now. Robert has worked with Fortune 500's, consulted for top Tableau partners, and many things in between. If data visualization is of interest to you, then reach out on Twitter @robcrok, find him on LinkedIn, or hang out in person at the upcoming Tableau and Tapestry conferences.

The Order of Operations
At this point in the series, I hope you can see and agree that SQL is amazing! It gives us the ability to answer just about any question you might want to throw at your database. This is also why those with SQL skills are so highly sought after! With a firm command of SQL, you have what it takes to freely explore your data stores.

Let’s say you are operating a food truck selling tacos.

With your SQL skills, you can find out which neighborhoods prefer El Pastor to Pollo, which combos are most common, which neighborhoods have a place in their heart for your sweet churros, and which like their tacos hot hot hot. I can’t tell you why I have gone on a Mexican food truck tangent, but I can’t stop. With SQL skills your questions don’t need to stop either.

Plus, SQL nearly reads like a sentence. How convenient?!?

SELECT tacos FROM the taco sales table WHERE spiciness is LIKE ‘Inferno’ AND price <= 5.00. Now, you might not ask your friends to tell you where to find cheap tacos truck with the spicy eats exactly like this, but your ask wouldn’t be far off. So, you almost speak SQL already.

SELECT taco,
  price
FROM taco_truck.sales
WHERE price <= 5.00

Tableau is similar in that you can easily create charts that look good by simply dragging and dropping pills onto different parts of the screen. Want to see sales by tacos in the form of a bar chart? Easy! Sales of our top three spiciest tacos over time? No problem!

However, when you questions become more complex, you’ll eventually need to dig deeper into Tableau or SQL…

Oh, you wanted to compare the market share of Carne Asada burritos at the three beaches you visited throughout the week? Or why not look at the 3 customers who have ordered the spiciest tacos across all locations? Why? Because they deserve a place on the wall of flame.

When we move that much data around, we have to be aware of exactly where those records are in the process and how they move through our SQL clauses. SQL is easy to read, but it can be confusing since the way we read it isn’t the same order as the way our computers process the code.

Rather than continuing to inflate this post with reference to the delicious cuisine south of the border and telling you why the way we usually think about SQL is wrong, I’d like set the record straight from the start.

It’s better that we get a good grasp of how the computer is processing your request before muddying the waters. So we are going to think deeply about how to persuade those processors to do exactly what you want in a language designed for that exact task. And what language would that be? SQL, duh!

So how does the computer interpret SQL?
Probably not the way you’d think.

The reason why understanding the order each clause is processed is so important is because the clauses are interdependent on one another. SQL is called a declarative language, which means you state what you want and the SQL engine will try to deliver your data as efficiently as possible.

Almost like placing an order at the taco truck. At the taco trucks, you don’t need to know what’s going on in the kitchen. You just tell the cashier at the counter what you’d like and the team does their best to deliver the delish in the most efficient way.

But you haven’t read this far to sit back and leave your taco delivery to chance. No, you want a tour of the kitchen. You’d like to open up your own taco truck. To do this you need to know how the magic happens under the hood.

To that end, SQL queries are processed in the following logical order:

1.  FROM
2.  JOIN
3.  WHERE
4.  GROUP BY
5.  <AGGREGATION>
6.  HAVING
7.  SELECT
8.  DISTINCT
9.  ORDER BY
10. TOP / LIMIT

I left a few less common clauses off the list, but if you nail these, you are in good shape. Unfortunately, it won’t do you much good to memorize this list. What we really need is to understand how this order came to be.

Thankfully, it is logical. For all you Type A’s like myself, this should relieve you of some anxiety.

Also note that we only need two of these clauses for a query, the FROM and the SELECT. The other clauses just increase the importance of your knowledge around how these clauses relate to one another.

FROM
This is the first and most important clause. Your query isn’t much good without telling the computer from which table to pull data.

JOIN
A clause needed for pulling data from multiple tables. Clearly important when adding other tables to the party. The computer wants all the raw materials loaded up before moving on to subsequent steps. Better to have more ingredients to choose from at the start rather than limit your creativity with nothing more than cheese, carne asada, and corn tortillas.

WHERE
Now that all our rows are loaded into memory, let’s get rid of those we don’t care about. The WHERE is your first chance to lighten the data load. This is done with a logical conditional. By logical condition, I mean those conditions that result in either a true or false. Often these are referred to as Booleans.

GROUP BY
Okay, here’s where things really get interesting. Adding a GROUP BY clause significantly changes what’s available to every subsequent clause on the list above. I know, we just started going through the list and it’s already getting complicated!

The biggest and most surprising impact I have learned is that only columns listed in the GROUP BY are available to columns further along in the process. No other columns from the table in your FROM or tables from your JOIN are available unless you put them in an aggregate function.

Besides being conscious of the important complication caused by the GROUP BY clause, I try to think of GROUP BY as grouping rows that don’t make sense to aggregate. If we isolated one column of names, for example, the result of the GROUP BY clause on that column would be a unique list of names.

<AGGREGATION>
This is where you can run those simple aggregate functions (typically in conjunction with a GROUP BY clause). You can read all about aggregates in part 3 of this series.

HAVING
This is basically a WHERE clause on the results of aggregations performed earlier. The WHERE clause allows you to filter the data from your FROM, but it’s done at the row level. HAVING lets you apply similar filters to data after SUMs, MINs, MAXs, AVGs, and other fun aggregates have been applied. Note: This is very similar to a filter you might create in Tableau based on an aggregated measure.

SELECT
Ha, you thought this was the first clause the computer executes! It’s okay. I was just as surprised as you when I found out. This clause is where we get to list the columns from our GROUP BY if we had one, or any columns from our FROM if we didn’t include the GROUP BY clause.

DISTINCT
The DISTINCT clause removes duplicate records. If you think about it, this coming so close to the end makes a lot of sense—I mean, how is the computer supposed to scan through and find duplicates before you have made your selection? It can’t. The DISTINCT clause has to follow the SELECT to know what rows are duplicates.

ORDER BY
Once all the processes above have complete we can FINALLY order what’s left.

TOP
We haven't introduced this concept yet, but TOP allows you to sample your data set. For example, the following will select the first 10 orders sorted by highest profit:

SELECT TOP(10) *
FROM dbo.Orders
ORDER BY Profit DESC;

Note that this can be written with or without the parentheses. It's also important to note that TOP is available in SQL Server only. To write the above statement in Oracle, you'd use the ROWNUM "pseuodocolumn" which basically just numbers the records. So the above query would be written as:

SELECT *
FROM dbo.Orders
WHERE ROWNUM <= 10
ORDER BY Profit DESC;

And, in MySQL, you'd use "LIMIT" as follows:

SELECT *
FROM dbo.Orders
ORDER BY Profit DESC
LIMIT 10;

Putting it into Practice
Let’s start with a simple query that displays our cities and their profit in descending order.

You can follow along by opening up Mode or SQL Server Management Studio and connecting to the database Ken has hosted for us. That query looks like…

SELECT City,
  Profit
FROM dbo.Orders
ORDER BY Profit DESC;

Now let’s look at the difference between the order we write (lexical) our clauses in are ordered and the order the SQL engine steps through them (logical).
 
Not much difference here, but the order is different even in the case of this very simple query.

What if we wanted to look at the total profit of each city rather than the individual rows?

Are query would look like…

SELECT City,
  SUM(Profit) AS Total_Profit
FROM dbo.Orders
GROUP BY City
ORDER BY Total_Profit DESC;

Now things get interesting and your new knowledge of SQL’s logical ordering comes into play.
Our FROM still comes first and the SELECT and ORDER BY come last, but the GROUP BY clause hops in front of the SELECT and we add an aggregation step.

One more simple example…let’s see which 3 cities are most profitable cities. That query will look like...

SELECT TOP(3) City,
  SUM(Profit) AS Total_Profit
FROM dbo.Orders
GROUP BY City
ORDER BY Total_Profit DESC;

We only added one additional clause but the way we write this and the order that the computer interprets the clauses differs.
Okay, the last example we are going to look at includes a join and a subquery. If you’d like to test yourself you can get a pen ready and try to sketch out how what the lexical and logical mapping will look like before scrolling down.

Here’s our query…

SELECT o.City,
  AVG(o.Profit) AS Average_Profit
FROM Orders AS o
JOIN(
  SELECT TOP(3) City,
    SUM(Profit) AS Total_Profit
  FROM dbo.Orders
  GROUP BY City
  ORDER BY Total_Profit DESC
  ) AS sub
  ON o.City = sub.City
GROUP BY o.City
ORDER BY o.City,
  Average_Profit;

Again, before scrolling down, see if you can write down the lexical order of our clauses on the left and the logical order the computer will process them all in on the right.
While the previous query might seem like a mess to unravel, it’s not so bad once you nail the order of operations.

What is new in this query is the idea of an outer and inner query. Again, I know these are new concepts, so don’t worry if it seems tricky. We’ll be fully unpacking the concept of inner queries or subqueries over the next couple of posts. If it isn’t clear in the diagram, just know that the order of operations we have learned will resolve itself and return a data set to be used in the JOIN clause. As soon as the JOIN is finished, we’ll continue as we have.

Here is a quick reference of all the clauses mapped from their lexical position to their logical.
Stay tuned to learn about the analytics capabilities of subqueries over the next couple posts.

Robert Crocker, October 14, 2018


2 comments:

  1. Hi Robert, thanks for this insightful post. I've been teaching myself SQL for the past few months and the explanation for order of operations is really valuable.

    I'm wondering in your last example how the syntax works within the sub-query and overall query. Would you start at the subquery and then work 'out' to the combined query, or group steps together?

    For instance, you have aggregations both in the sub-query and the joined dataset. Would SQL perform sum against the subquery first, and then average against the joined as the next step? Or would it pull from, join, group by, sum against the subquery and then repeat the steps for the joined data?

    In other words, do you look at the "whole" query and the subquery in parallel, where each order occurs against both at the same time, or would you apply all the order of operations to the subquery and then start over with the joined query?

    Thanks so much!
    Caitlin

    ReplyDelete
  2. The order of operations is going to beging at the outer most FROM and move to the JOIN from there. In this case that will lead you to the nested query where you'll start the order of operations over again. Once you resolve that inner query go pick back up with the outer query. Hope that helps :)

    ReplyDelete

Powered by Blogger.