Darnit, Data Munging

As I’ve continued my data analysis/visualization journey, I’ve bumped into the problem of “data munging.” This goes by various names, including wrangling, tidying, and cleaning. It’s commonly viewed as a major pain in the ass. That’s because, in many ways, it is.

The last few weeks I’ve been working on a project to visualize the Disproportionate Minority Contact (DMC) rates for the state of Minnesota (report here). However, in order to do this, I’ve had to take on a couple of prior data analysis tasks. On the surface, these seemed easy enough, but they’ve taken me quite a while to figure out:

  1. Find the total population by race of each neighborhood of Minneapolis.
  2. Find the highest race population for each neighborhood: For the base layer of my map, I want to show the DMC rate for the highest population of each neighborhood to develop a heat map of DMCs. In effect, I’m hoping this will viscerally depict the higher levels of surveillance/contact in neighborhoods with higher concentrations of people of color.
  3. Calculate the DMC rates for juveniles and adults, as well as totals.

I therefore had to find a dataset that showed a breakdown of crimes by race. The best (most granular) I could find was one for the entire state, and that was embedded in the FBI’s Uniform Crime Report – a PDF. So I first had to extract this data from the PDF. To do this, I use Tabula, which is an excellent little piece of software, to identify and extract the table.

Of course, Tabula is doing it’s best to pull out accurate data from a PDF table. But it didn’t work perfectly. I exported it to a CSV file and tried working with it in R, where I quickly realized that some of the data was being treated properly as numerical, others as factors. Looking through the CSV, I found that some of the numbers were in quotes and had strange spaces or commas in them (e.g. “4 3,123”). I first tried to force these into numerics using as.numeric() in R, but the spaces made this fail. Because it was a relatively small dataset, I ended up fixing all of these in Excel. In the future, I’d try csvkit or OpenRefine.

This finally worked. To work with this data, I ended up using the reshape and dplyr libraries in R (thanks to Andy Zieffler for some advice here). In a future post, I’ll cover what I ended up doing to reshape the data for my purposes, but I’m going to stick with munging tasks for now.

The next task was a harder one: calculate the DMC rates. This is actually computing a Relative Rate Index for arrests, which is calculated by the equation:

(Minority_Arrests / Minority_Population) / (White_Arrests / White_Population).

To do this, I had to add to the dataset above by also getting the total populations by race. This turns out to be easier said than done. Had I known where to look, I might have been able to finish this step quickly, however, it took me about two hours because I had to search through many different data sources. I ended up finding this dataset downloadable from the Census American FactFinder project, which gave me what I needed to calculate the total population of adults and juveniles.

However, given that this is a summary dataset, the CSV had only a single row and about a million different columns, of which I only needed a few. I used OpenRefine to import, modify, and re-export this CSV file into something usable by R.

Quick Tip: OpenRefine doesn’t work by default on the newest versions of Mac OS X. Apparently Apple did something to Java that requires a different kind of application signature by the software manufacturer to work properly. Typing the following in your command line BEFORE installing OpenRefine when in the folder you downloaded OpenRefine to should fix this problem:
xattr -rd com.apple.quarantine google-refine-2.5-r2407.dmg

I found this solution here: Making OpenRefine work on OS X Mountain Lion+

I had to modify the column headers because they were too long and descriptive – unwieldy to use in R.

Finally, I ended up with two usable CSV files. At least 3 hours later. The analysis took paltry time in comparison. Lesson learned: data munging is indeed time consuming.

The other somewhat annoying lesson is that a process like this is rarely replicable, in part because it goes through so many different pieces of software to get to this point. At some point, I’d like to go back and write a script that does this munging all in one place so my analysis can be reproduced elsewhere, but for the time being, that’ll have to wait.

Using the Qualtrics JavaScript API to Create Visualizations

If you haven’t used Qualtrics for surveys before, I highly recommend it. The University of Minnesota maintains a contract with the company and we have access to use it for free. It’s by far the most robust survey software I’ve ever used.

Today I needed to do some visualizations within Qualtrics for a client, based on the results of an individual survey. Rather than visualizing them for a report, however, the client wanted to visualize them within the survey, so that survey users could see the results.

Luckily, Qualtrics let’s you embed JavaScript within questions and draw in external libraries. I’ve been using Chart.js recently, so I decided to use that.  To import the library, open your survey editor and click “Look and Feel”:

Screen Shot 2015-02-06 at 10.16.14 AM

Then click “Advanced”, “Header –> edit”. From this window, click “Source”

Screen Shot 2015-02-06 at 10.17.30 AMCDNJS, which is a CDN (content distribution network) for JavaScript libraries, hosts a current copy of the Chart.js library. Copy the script tag, paste it in and click save:

Screen Shot 2015-02-06 at 10.19.12 AMNow save and return to your survey. The library should load flawlessly.

Now create a basic question and then create a following question made out of text. Put a page break in between.

Screen Shot 2015-02-06 at 10.20.51 AMOur next step is to find out the ID of the question(s) we want to draw from. You can do this easily by (1) editing the question text, (2) clicking “Piped Text”, then clicking “Survey Question”, finding the question you want, then selecting “Selected Choices Recode”. It should create something that looks like this:

Screen Shot 2015-02-06 at 10.21.36 AMGrab “${q://QID1/SelectedChoicesRecode}”.

Now on the left click the configuration wheel and click “Add JavaScript”. You should now see this:

Screen Shot 2015-02-06 at 10.23.36 AMCool. Now grab that text you pasted earlier and make it look a bit like this:

Screen Shot 2015-02-06 at 10.24.38 AMNow you are grabbing the piped text and turning it into a JS variable. We’re going to need to use parseInt() to convert the String into an Integer. But then we can use it to create a JavaScript chart, just like we did in this previous post.

Take a look at the final code:

Screen Shot 2015-02-06 at 10.27.36 AMOne final step before we’re done. We need to add our HTML canvas to the results. Go back to editing the question text and click “Rich Text Editor” then “Source”:

Screen Shot 2015-02-06 at 10.28.35 AMNow preview your survey. Enter in a choice and click “Next”. You should see a chart that looks like this:

Screen Shot 2015-02-06 at 10.29.59 AMAnd voila! You have charts in your Qualtrics results!

Oh, and there’s also a Qualtrics REST API. I’m a big fan.

Browser-based Data Manipulation with Miso.Dataset

This is the first in a series of posts about browser-based data manipulation and visualization.

Anyone hoping to jump on the web data visualization bandwagon is likely immediately intimidated when they start trying to figure out the nitty gritty of how to get various visualization libraries to work. Some like D3 and Google Charts have been around for ages. Other more opinionated and specialized libraries like Chartist are relatively new on the scene. The problem is that there’s about a million more. You can view and contribute to the one’s I’ve aggregated so far in my Data Science & Visualization Tools spreadsheet.

I’m writing this series of blog posts in the hopes that they can walk beginners through the various challenges of different steps in the process and different decision-points that might lead them to invest in a particular library.

So, you want to do some browser-based data viz? Cool – we’re going to figure it out together. Feel free to Tweet me questions along the way @alexfink. For reference, I’m not a professional developer, nor a professional quantitative data scientist. I consider myself first and foremost a critical qualitative researcher. Over the last few years, I’ve increasingly realized the importance of including quantitative data analysis in my work. Inspired by this book and some personal experiences, I’ve seen the ways that statistics and compelling data visualization can cause perspective shifts in important ways.

In my doctoral studies, I learned to do statistics in the open source programming language R. R is a relatively easy and robust language with a very large set of packages that allow for almost any kind of data analysis. R uses the command line of your operating system, but also has a graphical user interface called R Studio. I learned R and I liked R. But the interactive data visualizations – the ones being used all over the web – are built in HTML, CSS, and JavaScript (JS). These are the languages of the web and I wanted to be able to use them to do basic data analysis and visualization.

Let me be clear here. As far as I can tell, you are much better off doing regression analysis and other complex data analysis in R, Python or another statistical package. The data analysis libraries for JS aren’t nearly as robust. I won’t discuss most of that here (there’s loads of resources, including MOOCs and open access books that can help you learn this kind of analysis). The purpose of this series is to talk about taking a dataset from a CSV or Excel file and visualizing it online, using a variety of tools and libraries available.

Let’s get started with something basic: making a simple bar chart. Something like this:

Here’s the steps involved:

  1. Get a dataset to work with
  2. Make calculations on the data
  3. Choose a charting library
  4. Manipulate the data into a format that fits library specifications
  5. Use the library’s API to make a chart

That’s a lot of steps to make a chart you could’ve built in Excel in 5 minutes, right? I think so too. There are tools that help manage simple charts like this more easily in HTML/JS, but the more “opinionated” a tool (meaning the more it forces you to do things a certain way) the less flexibility you have in terms of what you can create. So learning to do all these steps will give you more power and flexibility in the future.

Step 1: Get a dataset

Get a dataset to work with. Preferably, find something in CSV format (Comma Separated Values). CSV is a standard format for data on the web. In this example, I’ll be using the VLSSpercapita.csv data set from this link. The first few rows of the dataset look like this:


As you can see, the column titles occupy the first row, separated by commas. The following rows match up to each of the columns with data, separated by commas. Technically, a CSV can use anything as a separator, but a comma is a broadly used standard.

Got this file downloaded? Great. Create a new folder on your computer and drag this file into it. Then create a new file entitled “index.html”. This is the web page that will display your chart.

Step 2: Make calculations

Most of the time, you aren’t handed data you can just visualize. You need to manipulate it in some way. For example, what kind of worthwhile visualization would I create out of this CSV with ~6,000 rows? Any visualization I can imagine would be sorta useless.

But we could create something meaningful if we manipulate the data a bit. For example, our data headings (and the data codebooks available on the above mentioned page) indicate that we have in Dollars the expenditure per capita of a surveyed household. We also know the Region in which this household resides. A very simple calculation might tell us what the average Dollars expended per capita in households in any given Region might be. This data proves to be useful by offering us a summary of an otherwise unmanageable amount of data.

So… how do we do this? This CSV appears to be sorted by Region, so we could go through and cut the document into pieces by Region. We’d quickly find out, however, that the document only appears to be sorted by Region. Which leaves us back where we started. So we need something that can manipulate the data and develop summary understandings with it. We could do this quite easily in R, but let’s say for funsies that we want to stay in JS (as I’d prefer to do). Unlike R, JS doesn’t have a built-in data summary function.

This means that we need to make one or use an already existing library. There are a number of libraries, like lodash/underscore that get us part way there by giving us great variable manipulators for JS. But they aren’t built to work specifically for data analysis. A great library I discovered that offers a minimalistic toolset for working with data in Javascript is Miso.Dataset. While relatively young, I found this project to provide the tools I needed to do some basic data manipulation in the browser.

So open up that index.html file in a text editor (or better yet, a code editor like Atom) and get hacking. Try this out to start with:

If you run this in a browser, you won’t see anything! Don’t worry about it. You’ve still gotta fix something. See the part that says <script src=”miso.ds.deps.min.0.4.1.js”></script>? Well, you need that to point to the Miso library. Which means we’ve got to download it. Download the Miso.Dataset JS file to your newly created folder. This link includes the dependencies you’ll need (and if that doesn’t make sense, don’t worry about it for now). To get it to work, you just need to make sure that src=”” statement points to a file by the same name that sits in the same folder as your index.html. That’s it – now you’ve got the Miso.Dataset library setup to run on your web site. If we refresh the index.html web page, you should still see nothing! But now, your code should be running well. Now you’ll see nothing because we haven’t done anything yet. At this point, all our code does is setup a basic HTML page, load the Miso.Dataset library, and prepare us to load the VLSSpercapita.csv file, noting that commas are going to be our delimiters. Well, that’s not very satisfying. But we’ve got the foundation in place for what we’re gonna do next. The next step is to really play around with this data – to group it by Region and then to get the mean Dollars of each Region. Here’s the code to update your index.html, and I’ll explain the rest at the bottom of it.
The top part of our code stayed the same, but then we added this .fetch() thing. What’s that? Well, you can read the documentation on the Miso website. In short, fetch is used to pull in a local or remote dataset. The way we’ve used it here, it pulls in the data asynchronously, meaning that it draws in the data over time and will continue to execute other functions that occur after it. However, we haven’t put anything after the fetch, but instead, inserted it into a success callback. This callback is executed once the fetch is successfully completed. That way we know the data has been fully loaded into the browser window. Don’t worry about this for now — it would mainly be an issue down the road when you are using a massive dataset or building a web site for the public to see. For now, it’s safe on our local computer and as long as we keep the rest of our workflow within the success callback, we’re golden.

I’m not going to explain the rest of the JS here – there’s loads of JS tutorials out there – but you can look up callbacks, event handlers, and anonymous functions if you want to understand a few things going on here a bit more deeply. The next thing we do with our data is log a couple of things to our browser console to show that things are working properly. You can access your browser console under “Tools -> Web Developer -> Toggle Tools” in Firefox and “Tools -> Developer” in Chrome. Where you see this:

console.log("Available Columns: " + this.columnNames());
console.log("There are " + this.length + " rows");

The text in quotes will be logged to the console. It will also show the names of the columns Miso.Dataset detected and the total number of rows.

Next is where the real work comes in. We now use Miso’s groupBy function. The first argument we provide is the name of the column we want to group by. The second is an array (as designated by []) of the other columns we want to include. We need to provide the latter part because we are in effect creating a whole new derivative dataset based on our initial dataset. There’s lots of language-specific reasons why we’re doing that, but this is also common to R and Python as well.

The next argument is an object with the attribute “method”, which in turn is handed an anonymous function. This is a lot of technical mumbo jumbo to basically say that we’re overriding the default action of the groupBy method for combining things into groups. The default is sum, but we don’t want a sum, we want an average! So we provide our own function here.

This function appears a bit complicated… and it is! It’s a complicated way to get an average across an array (again, something that would be quite easy to do in R or Python). But don’t worry about it. Instead, Google it! Whenever you get stuck at a point like this, try something like “javascript average across array” and you’ll find examples of what you’re looking for. I’m not going to explain our particular functionality here, because it dives too much into JS, but try looking at the docs for the lodash library if you want more information.

Finally, we log our new data to the console. Specifically, we are logging the Dollars column so we can see if our averaging worked out.

Step 3: Choose a charting library

This is a difficult choice because there are about a million. At this point, just pick one and give it a try. If you get really involved in this work, it will be worth exploring the many options and considering the various characteristics you might value: ease of use, scalability, speed, flexibility, responsiveness, capabilities, documentation, community, and so forth.

For the purposes of this post, I’m going to use ChartJS. ChartJS is a relatively robust set of responsive charts that use the Canvas element of HTML5 to paint out-of-the-box attractive graphs. It has great documentation, which makes it easy to get started. Once we do this quick example, you should be able to adapt it to making a line graph or a bar chart fairly easily.

Step 4 & 5: Finishing our graph

So, here’s the final code for your index.html page. Read through it and notice especially what’s changed. I’ve thrown in some comments (denoted by /* */ or //) that explain what’s going on, but I’ll explain it more detail after the code.

We haven’t added much here. In fact, the new code comes almost straight out of the ChartJS documentation. Basically, we’ve created a new object (denoted by {}) that has two attributes, “labels” and “datasets”, both of which are arrays (denoted by []). The labels, in order, denote the names that are attributed to the numbers in our CSV file (which I found by looking at the code book). If you look at the code book, you’ll notice they aren’t in order. That’s because the groups aren’t ordered sequentially by number! If you look again at your browser console, you’ll see that the console log of Region puts them in a different order. I simply matched the order of these to the right name for that region in the code book. But order is important here – the labels will match up only if we put them in the same order as the actual Regions are listed in the array output by the console.log().

The datasets array is where the magic happens. datasets is actually an array of objects, where each object represents one dataset. Each dataset object includes various options, in addition to the data itself. Here, I’ve included various color changes and a label. But the most important part is the data piece. Essentially, the data attribute takes an array of values. In the console.log we did of the grouped Dollars data, we saw this: Array [ 144.98576251455194, 219.60131914893623, 166.8890960451977, 195.5782360742705, 139.15073369565212, 350.4979667644188, 195.40419064748207 ]. This array is what we want to show up in our graph as our Y axis data, as it represents the averages of each region. So we use data: barData.column("Dollars").data again, which inputs the data we want into our graph.

We need to do two final things. We’ve now got data in the right format to go into our ChartJS library. This would look different if we decided to use a different library, depending on the format the library requires. To finish our chart in ChartJS, we now need to find the Canvas element we created earlier and ask ChartJS to actually create the chart. The final two calls do just that. The first identifies the Canvas element on the page, while the second uses the ChartJS library to map the chart.

… and viola! You’ve made a chart using entirely web technologies. Congratulations. I encourage you to play around with this — change the fill colors, labels, or if you get really creative, the way you are doing the data manipulation.

Ask questions. More to come.