Ads Top

SQL for Tableau Part 5: Subqueries

Welcome back to our series on SQL for Tableau Users. This week, Robert Crocker will be diving into subqueries...

In my opinion, you aren’t really having fun with SQL until you are writing subqueries. Subqueries answer much more complicated questions but they can look tricky. Lose focus while writing subqueries and who knows what you’ll end up with. We hope this walkthrough makes these powerful analytical queries more approachable so you can leverage them in your work, career, or side hobby.

If you have already set yourself up with SSMS, then you are ready to go. If you didn’t follow along with Ken’s setup in his first post, you could also try Mode Analytics. With Mode, you can easily connect to hosted databases. To connect to Ken's SQL database, follow this super simple process passing in the connection details below:

Host: ec2-52-14-205-70.us-east-2.compute.amazonaws.com
Port: 1433
Database Name: SupersStoreUS
Username: SQL
Password: SQL

Once connected you’ll be able to write some simple SQL…

SELECT *
FROM Orders;

The result of that query should look something like...


Awesome!

This isn’t a subquery, but taking a quick sample of your data is a great place to start in any case. Even in Tableau, we start by sampling new data connections.


The query Tableau writes to retrieve data from your data source is very similar to the query...

SELECT TOP(1000) *
FROM [DATE SOURCE];

This query we wrote in Mode to get us started says SELECT everything (denoted by the *) FROM the orders table.

SQL is a breeze, right?

Now let’s analyze some data...

 

As analysts, you know answers are rarely found on the surface. More often the process of getting to the bottom of your problem looks a lot more like asking why, why, why, why, why. Don’t worry, we won’t go too far down this rabbit hole, but you’ll at least see enough power in subqueries to want to explore their possibilities.

Let’s say you want to see a list of cities in descending order of profit? Said differently, you’d like to see of list cities, ordered from most profitable to least profitable. Go on, take a crack at it. I’ll wait…

If you don’t remember the clauses Ken introduced in his first post, here’s a link back to the basics.

Ready to look at the solution?

If written well your query will look something like this...

SELECT City,
  SUM(Profit) AS Total_Profit
FROM dbo.Orders
GROUP BY City
ORDER BY Total_Profit DESC;


This query will give us a nice list of cities and their total profit sorted in descending order. Of course, we could save the sorting step for Tableau, but it’s always nice to do a spot check before moving on to the next step. You’ll want to remove the ORDER BY before moving this code to production.

With this result, your manager might say something like, “Thanks for that, now what I’d REALLY like to see are the ten cities where we are pocketing the most cash.” You are then left to translate this human chatter into something interpretable by your computer. What your manager is really after are the ten most profitable cities.

Now it’s your turn to take a crack at writing this next query. C’mon she’s waiting...

Hint:
SELECT TOP(10)....
FROM
GROUP BY
ORDER BY

I’m sure you aced it. Here’s what I wrote:

SELECT TOP(10) City,
  SUM(Profit) AS Total_Profit
FROM dbo.Orders
GROUP BY City
ORDER BY Total_Profit DESC;


“Great! Oh, and I just had another thought”, your manager starts off again, “I wonder what the average daily profit looks like across the top three cities. Do you think you can have that done by the end of the day? It’d be nice to know what days are best to launch our campaigns”

“Of course”, you say in a cold sweat. Time to break that request down into its itty bitty pieces.

Before attacking this problem, let me explain why subqueries are perfect for queries that rely on data from multiple tables. With subqueries, we can ask for a result set from another table and join back to the main table so long as we have a common column between the tables with the data we need.

Let’s imagine the case where we have a customer table and a sales table and we want to find sales by customer. Your query would look something like...

SELECT Customer.Name AS Customer_Name,
    Customer.Address as Customer_Address,
    Customer.Zip AS Customer_Zip
    Sub.Total_Sales
FROM Customer
JOIN (
SELECT Customer_id,
SUM(Sales) AS Total_Sales
FROM sales
GROUP BY Customer_id
) AS Sub
ON Customer.Customer_id, = Sub.Customer_id;

In the example, we calculate Total_Sales from the sales table and pull all customer attributes from the customer table. The key here is that we leave ourselves a linking field, Customer_id. Without this common field, which is often called a key, no relationship can be made.

Now back to the question posed by your boss.

Ultimately what we want to show are those days that are lagging behind in the top three most profitable cities. Whether that’s a good question to ask isn’t your call. At this stage, you are just trying to grab the right data.

In order to get your tidy table you need to first find the top 3 most profitable cities AND THEN disaggregate profit across all days of the week. Each of those questions is a tidy table, but one depends on the other. This is where subqueries enter the scene.

TERMINOLOGY: Sometimes subqueries are called nested queries or inner queries. I’ll try to stay consistent, but I’ll probably hop around.

When writing queries like these, it is smart to start from the inside and work your way out. In this case, start with grabbing the top three most profitable cities. Luckily you have essentially written this query already. You just need to modify it to pull back the top three instead of the top ten.

That’ll look a little like…

SELECT TOP(3) City,
  SUM(Profit) AS Total_Profit
FROM dbo.Orders
GROUP BY City
ORDER BY Total_Profit DESC;


Now that we have the top three cities, we can use them to filter the result of our next query. We’ll create this filter via the JOIN. If you need a refresher on JOINs be sure to visit Ken’s previous post.

As we start to think about the final table to be used in our reporting, we can decide on which columns are needed to answer our managers' question. Those columns would be cities, days of the week, and the average profit. Oh, and it might be nice to order by city and days of the week. With that structure in mind, let’s take a look at how the SQL query. No need to cram all the logic in at once. We can take this one step at a time.

The only wrinkle is that we need to use a function to derive the day of the week. As Ken noted in his first post, functions often vary between different database systems and getting the day of the week is no exception. Here are the functions you’d need in the most common database platforms:

SQL Server:  DATENAME(dw, [COLUMN NAME]) AS Day_of_Week
Oracle:  TO_CHAR([Date], 'DAY') AS Day_of_Week
MySQL:  DAYNAME([Date]) AS Day_of_Week

What did you come up with? Something like this...

SELECT City,
  DATENAME(dw, [Order Date]) AS Day_of_Week,
  AVG(Profit) AS Average_Daily_Profit
FROM Orders
GROUP BY City,
  DATENAME(dw, [Order Date])
ORDER BY City,
  Day_of_Week;


Now we just need to make room in our outer query for the top three cities. Joining on the city name is going to filter the results of our outer query to only those cities we are targeting.

If we had a table that told us each cities rank by profit, then we would be able to use a single table. But, in the real world, databases usually lack this sort of intelligence. Therefore we need to do a bit of work and leverage subqueries. In this case, we are going to go find the top ranking cities ourselves. We’ll then leverage the results in our outer query.

Hint:
SELECT o.City,
  DATENAME(dw, [Order Date]) AS Day_of_Week,
  AVG(o.Profit) AS Average_Daily_Profit
FROM Orders AS o
JOIN ( ... ) AS sub
  ON ... = ...
GROUP ...,
  ...
ORDER BY ...,
  ...;

How’d you do? Here’s my final query:

SELECT o.City,
  DATENAME(dw, [Order Date]) AS Day_of_Week,
  AVG(o.Profit) AS Average_Daily_Profit
FROM Orders AS o
JOIN (
  SELECT TOP(3) City,
    SUM(Profit) AS Total_Profit
  FROM dbo.Orders
  GROUP BY City
  ORDER BY Total_Profit DESC
  ) AS sub
  ON o.City = sub.City
GROUP BY o.City,
  DATENAME(dw, [Order Date])
ORDER BY o.City,
  Day_of_Week;


I know that might have been tough, but it’s also fun if you don’t get to write these often. It’s always a pleasure to see good SQL in action.

Now, before you go running off to your manager with this raw table of data you should put your Tableau skills to good use. We are multi-talented professionals dammit! Visualize your work so your audience is delightfully informed.

Here’s what I got.


If she sticks to her gun, it will be easy for her to see that a campaign in NYC should be run on Saturday, mid-week in LA, and Sunday in Seattle.

At the end of the day, subqueries are just one table referencing another. There are countless permutations you can run through to get value out of these. In future posts, we’ll share suggestions on how to keep your code clean and introduce the concept of Common Table Expressions (CTEs). Keep an eye out for these in the coming weeks.

In the meantime, if you have any questions, feel free to reach out to me. You can find me on Twitter @robcrok or on LinkedIn.

Robert Crocker, November 10, 2018


1 comment:

  1. Challenging and Worth Taking! Amazing way of articulation. Thanks Ken & Rob..

    ReplyDelete

Powered by Blogger.