Ads Top

A Starter Kit for Text Analysis in Tableau


I was recently asked to visualize some survey results. I’m always excited to start a new project like this, so I eagerly jumped into the data. Unfortunately, I immediately found that almost every question was open-ended. There were virtually no multiple choice or numeric-answer questions at all. Thus, analyzing this data would require something a bit more qualitative than quantitative. It was then that I realized that I had absolutely no idea how to visualize this data and would need some help.

Working at a university, I knew there would be someone there who could help me understand how best to visualize this data. So I spoke to the folks in our Digital Scholarship team and they introduced me to an online software package called Voyant Tools. When you connect to the tool, it prompts you to upload your text, then allows you to do some basic analysis of that text.


The tool provides a variety of different ways to visualize the textual data, including word clouds, word counters, bubble charts, and many more. While I thought the tool was quite good, there were a few drawbacks for my specific use case. First, my data was tabular—it had a row for each survey response along with dozens of columns for each individual question. I could not find a way to get Voyant to load such a data file—it simply expected a large chunk of text. Second, the tool did not have the ability to build custom charts. Being accustomed to Tableau, I felt that this was critical. So, I decided to come up with a way to visualize this in Tableau. Thus was born what I’m calling my Tableau Text Analysis Starter Kit.

A Forewarning…
Before I dive into the starter kit, I want to say that text analysis is a large and complex field and what I’m sharing here only includes very basic capabilities. Additionally, other than doing a few visualizations that compare word usage (such as Word Usage in Sacred Texts), I’ve done almost no text analysis in my career. So please keep that in mind as you read the rest of this blog. My goal is simply to provide some tools to allow you to get started with text analysis in Tableau, along with some examples of potential charts you can create once you have data in a usable format.

Data Prep
With clean data, there are no limits to what can be created in Tableau, so the biggest challenge of this project is to break these large chunks of text into smaller pieces—pieces that can then be contextualized and, to a degree, analyzed using quantitative methods. I started out by considering a few different methods for data prep, including more traditional data prep/ETL tools and an online tool by Wheaton College called Lexos, which allows for in-depth cleaning and prep of data. Both of those options proved to have a couple of drawbacks, so I decided that I’d have no choice but to write some code using Python.

Warning: Though I have a background in programming, I’m an amateur with Python, so if you decide to look at my code, please keep that in mind!

So, after working on the code for a couple of weeks, I now have a Python script that does the following:

1) Breaks each text-based field into individual words (a row for each word).
2) Flags stop words—very commonly-used words such as the, a, is, etc.
3) Identifies each word’s stem—a sort of root word that is shared by multiple similar words (for further details, see Word Stems in English).
4) Breaks each text-based field into n-grams—segments of n contiguous words (the number of words, n, can be specified by the user).
5) Performs basic sentiment analysis for each word and each n-gram.
6) Groups words and n-grams into sections so that you can see how words and phrases are used over time.
7) Outputs a file for individual words and a file for n-grams. Each file links back to the original file via a key (so that you can join back to it in Tableau). Additionally, each file assigns a unique sequential identifier to each word/n-gram so that you can order them in your analysis.

The output files will look something like this:

Sample of Word File

Sample of N-Gram File

Note: The code makes heavy use of the Natural Language Tool Kit (NLTK), including SnowballStemmer for word stemming, Vader for sentiment analysis, and the NLTK’s list of common stop words.

See the “How To Use” section for details on exactly how to use the code with your own data.

On To Tableau
With the data now in a structure that makes it easier to visualize, we can bring it into Tableau. As the word and n-gram files only contain the text fields that were parsed, I’d recommend that you start by joining them back to your original file. From there, you can fairly easily build your charts. For example, we can, of course, create word clouds (for both individual words and for n-grams). Note: All of the below samples have been generated from the US Constitution.



But I’m not a fan of word clouds as I feel there are usually better ways to visualize such data. So, instead of word clouds, let’s just use a simple sorted bar chart. The following allows you to click on a word, then highlights that word with the text so that you can see where it’s used.


And there are lots of other charts that can be created such as a bubble line chart, which I’ve based on a similar chart in Voyant, a barcode chart, line charts showing usage over time, etc. All of these can be easily created with the word and n-gram data from the Python program, with no extra data prep.

If you take some time to do a bit of additional data prep, you can create somewhat more advanced charts. For instance, you could create a tree diagram.


Or you could create a network diagram.


Or you could even create a sort of “circular link diagram” that shows word connections throughout the text.


Note: For details on how to create a tree diagram, see Jeff Shaffer’s blog, Node-LinkTree Diagram in Tableau; for details on how to create a network diagram, I recommend Chris Conn’s How to use Gephi to create Network Visualizations for Tableau.

In order to make these different charts easily accessible, I’ve compiled them into a single Tableau Public workbook, which you can download and use as desired.


How To Use
OK, here’s how to use the Python program. For starters, you need to create an input file in CSV format. This file needs to have one field called Record ID which contains a unique numeric ID. This will allow us to index the output files so you can easily join it back to the original file. The file must also include one or more text-based fields. The file can, of course, contain any other fields you desire—the program will simply ignore them.

Once you have your input file ready to go, you’ll need to run the Python code. You have two options here. First, if you’re comfortable downloading the code and running it yourself, you can find the code on my Github account: Text.py. If you’re not comfortable with this, then I’ve compiled the code into an executable package which should run on pretty much any platform. Download the package, Text Analysis Package, onto your computer (Note: because the NLTK package is very big, this executable package is quite large as well—over 1 GB—so be patient while downloading). Once you’ve downloaded the package, unzip it to a location on your computer, then run the Text.exe program from within the dist folder.

When you run the program, it will show a simple GUI for collecting inputs.


It isn’t the prettiest GUI, but it works! You’ll now need to update each of the inputs as described on screen. Once you’ve updated the fields, click Submit. For files with limited text, the program should complete quickly; larger texts might take a bit longer. But, when it’s done, it will output two files to the same location as your input file, called Words.csv and NGrams.csv.

Finally, you can download my Tableau workbook, update the data sources to use these files, and starting building!


A Couple of Examples
Before I wrap up, I just want to point you to a couple of recent examples that that use some of these text analysis techniques. The first, by Robert Rouse, analyzes the text of The Acts of the Apostles. This is a truly amazing piece of work, so I highly recommend that you take a look. My second example, by Elijah Meeks, analyzes responses from the 2019 Data Visualization Community Survey. While this was not created in Tableau, it is an excellent example of how we can use these techniques with open-ended survey questions. In fact, his use of tree diagrams is what inspired me to include them in this starter kit.


Thanks for reading! If you have any thoughts or questions, please let me know in the comments section.

Note: As you'll see in the comments, there have been a number of people who've had trouble running the GUI. If you have this problem, please contact me--most of the time, it has a reasonable explanation, so I'm happy to work through the problem with you.

Ken Flerlage, September 28, 2019

8 comments:

  1. Hi Ken,
    Thanks for posting this! I'm getting KeyError: 'Record ID' when I run the Python program. One of my fields is called "Record ID"...any clue why this is happening?

    Thanks!

    ReplyDelete
    Replies
    1. I saw this the other day as well. For some reason, Python seems to see some extraneous characters in the first field name. Can you try moving Record ID to be the second column in your data file? If that doesn't work, please email me...flerlagekr@gmail.com

      Delete
  2. Awesome guide! I'm having some problems running the exe and was wondering if you had any quick tips. Anaconda installed, yet the exe does nothing upon clicking. The code also hangs when i give an input file, turns to not responding on a small dataset. Any ideas anyone?

    ReplyDelete
    Replies
    1. Thanks for the feedback. I'd like to figure out what's going on here so I can correct the problem. Any chance you could send me an email? flerlagekr@gmail.com

      Delete
  3. Anyone have luck addressing the above? Experiencing the same issue re: .exe and .py file doing nothing when executed via python desktop app or command prompt. thanks in advance,

    ReplyDelete
    Replies
    1. I really want to address this problem so I can make sure others do not receive it. Can you email me at flerlagekr@gmail.com?

      Delete
  4. Just went through your blog on Word cloud...started downloading your python package...hopefully it should work with the large data set i have..will keep you posted in case of any issues...

    One question i have is: Is there no other way to directly work it through tableau without running python or some other coding script?

    ReplyDelete
    Replies
    1. Not that I can think of. I try to do everything in Tableau Prep or Tableau Desktop as my programming skills aren't what they used to be, but this was a situation where I needed something more.

      Delete

Powered by Blogger.