Ads Top

SQL for Tableau Part 1: The Basics


Data is rarely structured so perfectly that you can simply drag and drop your data into Tableau. Most of the time, you’ll need to perform some transformation or cleaning of that data. Data prep tools such as Tableau Prep, Alteryx, WhereScape, and any dozens of other ETL tools can make this much easier. And many of these tools provide a logical, graphical interface to build your workflows. But, under the covers, most of these tools are simply executing Structured Query Language commands to the database. So, understanding SQL can be quite valuable to a Tableau user. Just look at most Tableau job descriptions and you’ll often see a requirement for SQL.

I have always worked in Information Technology (as opposed to the “business” side) and I spent many years as a SQL Server DBA. I have also spent years building data warehouse models, so I have a pretty deep understanding of SQL. However, many Tableau analysts do not have similar backgrounds, so this series of blog posts will attempt to provide a primer to SQL, with a specific focus on Tableau users. That being said, I hope that it will also prove a useful guide to others as well. Just be aware that I’ll often refer back to Tableau throughout the series. If you’re okay with that, then read-on; if not, then there are lots of other SQL resources out there that may be more applicable to you.

Our Sample Data
To aid in your ability to practice what I’m showing here, I’ve set up a publicly accessible SQL Server instance on AWS and created a database based on the 2018.2 version of Tableau’s Superstore data. I should probably note that this instance is quite small (so that I can stay within AWS’s free tier) so it won’t be fast, but it should work fine for our purposes. To access the database, follow these steps:

1) If you do not already have it, install SQL Server Management Studio. Management Studio is a free GUI for managing SQL Server, but it’s also the primary tool for writing, testing, and deploying SQL.

2) Once installed, open SSMS and, in the “Connect to Server” dialog, enter the following:
         
Server Name: ec2-52-14-205-70.us-east-2.compute.amazonaws.com
Authentication: SQL Server Authentication
Login: SQL
Password: SQL

This user will grant you read-only access to the “SuperStoreUS” database only.

Relational Database Theory
Before getting into some SQL, it’s important to have a basic understanding of relational database theory (I’m specifically referring to “relational” databases here because of the new breed of “No SQL” database platforms which eschew the standards of traditional relational platforms. Many of them support some SQL, but not all. We’re going to focus on the traditional relational database management systems (RDBMS), which include platforms like Oracle, MySQL, PostgreSQL, and SQL Server.)

Relational databases organize data into “tables” (also called “relations”) consisting of rows and columns. These columns are also commonly referred to as “attributes” or “fields” (and sometimes more specific names like “dimensions” and “measures”). Rows are often called “records” or “tuples”. I’ll uses these terms somewhat interchangeably in this series. Each column in a database tends to have a specific data type—text, number, date, etc.—and length. This is one area where databases differ a bit from spreadsheets (and some No SQL platforms). Spreadsheets do not enforce specific data types, which means that tools like Tableau must sample the data and make a guess about the data type of each field. True databases’ data type enforcement leads to more consistent and reliable data.

Relational database tables tend to include a primary key (but not always) which includes one or more columns and uniquely identifies the records. Tables can also include other “indexes” which help to enhance the ability to quickly and efficiently retrieve data. If an appropriate index exists to assist with a lookup, then the database is able to quickly find the data in the index rather than scanning through the entire table. This is a key point when it comes to using SQL to retrieve results from a database because indexes can be critical to creating high-performing SQL. Look for a future post about optimizing your SQL where we’ll discuss this in a bit more detail.

Databases contain many other “objects” besides tables. Some of the key ones are as follows:

Views – A “view” of the data in a database. This can be just a slightly different way to look at a single table or a more complex combination of multiple tables through constructs such as joins and unions. But, views do not really contain their own data. They are just SQL that provides a different way to look at the data. Note: This is not entirely true as some platforms allow you to “materialize” a view such that it contains the actual results of the query.

Stored Procedures – SQL contains many of the constructs available in full-blown programming languages, which means that SQL developers can write complete data-driven programs. Often these programs will be encapsulated into a “Stored Procedure” which can be called through use of a simple SQL command. Stored procedures can take on multiple purposes. Some may only write to a database. Others may perform a series of tasks and, in the end, return some result set (more on this in a future post).

There are lots of other object types, such as user-defined functions, constraints, etc. but we won’t go into those here.

Database Management Systems (DBMS) tend to be installed on a server (or a “cluster” of multiple servers). The software can be installed multiple times on a single server, resulting in multiple separate “instances”. Each instance can contain multiple “databases.” And each database contains multiple tables and other objects (Note: This is SQL Server terminology; Oracle swaps the usage of “instance” and “database.”). In the example below, ec2-52-14-205-70.us-east-2.compute.amazonaws.com is the instance name, SuperstoreUS and Test are databases, and Orders, People, and Returns are tables within the SuperstoreUS database.



Structured Query Language (SQL)
Techopedia defines Structured Query Language as “a standard computer language for relational database management and data manipulation. SQL is used to query, insert, update and modify data. Most relational databases support SQL, which is an added benefit for database administrators (DBAs), as they are often required to support databases across several different platforms.” So, essentially, SQL is the language of databases. It is used for virtually every single database operation. Even GUIs like SQL Server Management Studio are simply executing SQL based on your button clicks.

Since 1986, the American National Standards Institute (ANSI) has maintained a specific set of standards for SQL. Most relational database platforms support this standard, but because it does not meet every possible need, individual vendors such as Oracle and Microsoft (SQL Server) have extended their own version of SQL within their platforms (Oracle calls their version “Procedural Language/Structured Query Language” or “PL/SQL” and Microsoft calls theirs “Transact SQL” or “T-SQL”). This is very important to remember as each database platform’s version of SQL will vary slightly. However, you will find that most of the constructs remain the same and, if you know SQL for one platform, it is relatively easy to adjust to another.

Due to my experience with SQL Server, I’ll be using it throughout this series. However, I’ll try to use standard ANSI SQL as much as possible and I’ll do my best to clearly note when I deviate from this standard.

SELECT/FROM
As noted above, you can do lots with SQL. But, in the use of Tableau, most often you will have little need for most of this functionality. Your main focus will be on retrieving data. For this, you’ll use a SELECT statement. Here’s the basic syntax:

SELECT <Field Names> FROM <Table Name>

For example, the following will retrieve the Order ID and Order Date from the Orders table:

SELECT [Order ID], [Order Date] FROM [Orders]


Note the use of square brackets. Strictly speaking, if no spaces or special characters existed in the field names, we wouldn’t need the square brackets. But, it’s good to get into the habit of including them around all objects and field names (and since Tableau does this as well, it should be a fairly easy habit to form).

As you can see, the list of fields you’re retrieving is just separated by a comma. You can include as many fields as you like. If you wish to simply retrieve all fields from the table, you can replace the field names with * as follows:

SELECT * FROM [Orders]


While this can be much easier to write, it is generally discouraged because it requires the database to work harder to return your data, resulting in slower execution times. We won’t be dealing with query optimization in this blog, but for now, just take my advice that it’s best to explicitly name only the fields you need, rather than using SELECT *.

A side note on table names: A database can contain different schemas. I won’t go into detail about the purposes of these different schemas, but just be aware that they can exist. You can also use SELECT statements to pull data from different databases within a single instance. With all of those variations, it is often best practice to prefix the table name with both the database and schema name. Our database is called “SuperstoreUS” and our schema is dbo (the default for SQL Server), so we could rewrite the our earlier SQL as:

SELECT [Order ID], [Order Date] FROM [SuperstoreUS].[dbo].[Orders] WHERE [Region]='East'

Because we’re working within a single database and schema, I’ll be dispensing with this more verbose syntax for the remainder of this blog. Just be aware that this syntax may be required in certain circumstances.

WHERE
We can also add criteria to our statement in order to filter our result set. For example, the following will retrieve the Order ID and Order Date from the Orders table, but only those orders in the East region.

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East'

Note the use of single quotes around text values. Tableau allows you to use either single or double-quotes, but in SQL, you must use single quotes only.

The criteria itself is a boolean expression (one that is either true or false). For example, [Region] = ‘East’ will either be true or false for each record in the table. The query above will only return those records where the expression evaluates to true. You can specify any number of criteria you like as long as the overall expression is boolean. Take a look at the following examples:

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East' AND [Segment]='Consumer'

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East' OR [Region]='West'

The first statement will only retrieve orders in the East region for customers in the Consumer segment. Since we want both of these to be true, we separate them with “AND”. The second statement includes an “OR” rather than “AND” so it will return any record where either of the criteria is true—either the East region or the West region. You’ll notice that this criteria is fairly similar to what you might put in an IF statement within a Tableau calculated field.

ORDER BY
In many cases, you may wish to sort your result set. For this, you’ll use an “ORDER BY” clause. ORDER BY is added to the end of the statement as shown below:

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East' ORDER BY [Row ID]

This will take the results and sort them ascending by the Row ID. Note that this field does not need to exist in the list of fields being retrieved. You can explicitly state the direction of the sort by adding ASC (ascending) or DESC (descending) to the end (if excluded, ascending will be used by default), such as:

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East' ORDER BY [Row ID] ASC

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Region]='East' ORDER BY [Row ID] DESC

Like your list of fields to be retrieved, your order by statement can also include multiple fields. For example:

SELECT [Order ID], [Order Date] FROM [Orders] WHERE [Segment]='Consumer' ORDER BY [Region] DESC, [Order ID] ASC

This statement will first sort the results descending by Region then ascending by Order ID. So, with the superstore data set, all the West region records will come first, followed by South, then East, then Central. Then, each region’s records will be sorted ascending by Order ID.

AS
In some cases, you may wish to rename your columns before returning the results or you may wish to create entirely new columns. This is where the “AS” clause comes into play. For example:

SELECT [Order ID], [Order Date], [Product ID] as [SKU] FROM [Orders] WHERE [Region]='East'

This will change the Product ID field’s name to SKU in your result set, as shown below:


This can be particularly useful in Tableau, especially if you are trying to replace some existing data source and need the field names to match up. It’s also very useful in UNIONs and many other use cases. And, of course, it’s nice to have the flexibility to change the field names without doing so manually in Tableau.

But AS can also be used to create entirely new “derived” fields. Here’s a simple example:

SELECT [Order ID], [Order Date], 1 as [Number of Records] FROM [Orders] WHERE [Region]='East'

This will create a new field called Number of Records, which simply contains a 1. This example is much like the “Number of Records” calculated field in Tableau and could be used for counting records.

But, these “derived columns” can also get much more complex, involving the use of mathematical expressions such as:

SELECT [Order ID], [Order Date], [Sales]/[Quantity] AS [UnitPrice] FROM [Orders] WHERE [Region]='East'

This example finds the unit prices by dividing the sales amount by the quantity.

Or we can introduce various “database functions”. For example:

SELECT [Order ID], [Order Date], [Product ID], LEFT([Product ID], 3) AS [Product3] FROM [Orders] WHERE [Region]='East'

This example creates a new field, Product3, which contains the first three characters of the Product ID field. There are many different built-in functions available within SQL, but this is an area where you start to see a lot of variation in different database systems. For example, most variations of SQL include a function that is similar to Tableau’s “MID” function, which extracts a portion of a string. In SQL Server, the function is SUBSTRING, but in Oracle, it’s SUBSTR. In PostgreSQL, it’s also SUBSTRING, but the parameters are different than SQL Server. So, we have to be very careful here. That being said, these functions can be extremely powerful and we’ll address them in some more detail in a future blog post.

Coming Soon
That’s all for this introductory blog post, but I’ll be writing more on this topic in upcoming blogs. Here are a few of the topics we’ll address:

• Joins & Unions
• Aggregation
• Retrieving Distinct Records
• Built-in Database Functions
• Sampling Your Data
• Pivoting and Unpivoting
• Sub-Queries
• Query Optimization
• Using Initial SQL in Tableau
• Passing Parameters to Custom SQL in Tableau
• Other Cool Stuff!!

As always, thanks for reading!!

Note: Part 2 (Combining Data via Joins and Unions) is now available!

Ken Flerlage, September 3, 2018

19 comments:

  1. Thank You sir For Giving such Good Information About SQL...

    ReplyDelete
  2. Thanks! Can't wait for the upcoming posts.

    ReplyDelete
  3. Thank you for sharing, very detailed explanation.

    ReplyDelete
  4. Can't Connect to SQL Server on AWS. Kindly Assist.

    ReplyDelete
    Replies
    1. What error are you getting? Are you trying to connect from a personal or work computer? If work, then it's likely that some firewall is blocking you.

      Delete
    2. I can't connect either. My error: SQL Server Provider "In order to ran the trace against SQL Server you must be a member of sysadmin fixed server role or have the ALERT TRACE permission

      Delete
    3. That doesn't make sense. Can you send me an email offline and we can try to troubleshoot? flerlagekr@gmail.com

      Delete
  5. It's really helpful...thank you for sharing this

    ReplyDelete
  6. echoing all previous comments - this is great and the clearest/cleanest explanation I've read so far! This page has been bookmarked and I will keep checking for the next installments. Thanks, Ken!

    ReplyDelete
  7. Thank Ken Very insightful, I come from the business side of things and I'm moving into the SQL side of analytics. Thank you for explaining it this way.. Good Job!

    ReplyDelete
  8. Thank you for your sharing. I am a junior Tableau user who also is interested in learning SQL simultaneous with Tableau. I feel lucky to find your blog. I would like to follow you from now on.

    ReplyDelete
  9. Very interesting post, looking forward to the forthcoming ones.

    As an intermediate user of both SQL and Tableau, I've often wondered whether new data prep tools I'm currently looking at like Tableau Prep or Alteryx are nothing else than a sort of SQL query builder in disguise for people without sql programming skills. My understanding is that in the end the do send sql queries to the db, right?

    The final goal should be to keep data preparation/blending as close as possible to where the data is stored (DBs, DWs DL, etc) for performance reasons mainly and making the final data set as simple/clean as possible for the visualization tool (tableau, powerbi, qlik, etc). As such using SQL as the db native language does makes a lot of sense to me. Having said that those new tools have their own pros. What's your view on that? Generally speaking do you see a value in using a combination of sql + prep tools or would you go for one or the other according to programming skills?

    ReplyDelete
    Replies
    1. I think a true ETL tool is an absolute necessity. Sure, you could write SQL to do all of it, but that means you've created all custom code, which has to be maintained. ETL tools allow you to put that into a visual workflow, which is much easier and more efficient to build, understand, and maintain. And there are lots of other advantages of ETL tools. One key one is that transferring data from one source to another cannot be easily done in pure SQL--ETL tools provide that capability from a variety of source platforms. Plus ETL tools often take away the heavy burden of minor tasks that you have to perform day in and day out. SQL is really important, but I believe that it's a poor alternative to a true ETL.

      Delete
  10. HI, cannot connect to the Server. The error shown is the following:

    TITLE: Connect to Server
    ------------------------------

    Cannot connect to ec2-52-14-205-70.us-east-2.compute.amazonaws.com.

    ------------------------------
    ADDITIONAL INFORMATION:

    A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)

    For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=53&LinkId=20476

    ------------------------------

    The network path was not found

    ------------------------------
    BUTTONS:

    OK
    ------------------------------

    ReplyDelete
    Replies
    1. Are you trying to connect from work? It is most likely a firewall that is blocking your connection.

      Delete
  11. Awesome! Very helpful, thanks Ken. Love the way you lay this out & explain it, and looking forward to reading the next ones in the series.

    ReplyDelete
  12. A BIG THANKS! for the Awesome article Ken. Well articulated! I worked out while i gave a read to this. This is Awesome! Thanks once again, Ken!

    ReplyDelete

Powered by Blogger.