As I’ve continued my data analysis/visualization journey, I’ve bumped into the problem of “data munging.” This goes by various names, including wrangling, tidying, and cleaning. It’s commonly viewed as a major pain in the ass. That’s because, in many ways, it is.
The last few weeks I’ve been working on a project to visualize the Disproportionate Minority Contact (DMC) rates for the state of Minnesota (report here). However, in order to do this, I’ve had to take on a couple of prior data analysis tasks. On the surface, these seemed easy enough, but they’ve taken me quite a while to figure out:
- Find the total population by race of each neighborhood of Minneapolis.
- Find the highest race population for each neighborhood: For the base layer of my map, I want to show the DMC rate for the highest population of each neighborhood to develop a heat map of DMCs. In effect, I’m hoping this will viscerally depict the higher levels of surveillance/contact in neighborhoods with higher concentrations of people of color.
- Calculate the DMC rates for juveniles and adults, as well as totals.
I therefore had to find a dataset that showed a breakdown of crimes by race. The best (most granular) I could find was one for the entire state, and that was embedded in the FBI’s Uniform Crime Report – a PDF. So I first had to extract this data from the PDF. To do this, I use Tabula, which is an excellent little piece of software, to identify and extract the table.
Of course, Tabula is doing it’s best to pull out accurate data from a PDF table. But it didn’t work perfectly. I exported it to a CSV file and tried working with it in R, where I quickly realized that some of the data was being treated properly as numerical, others as factors. Looking through the CSV, I found that some of the numbers were in quotes and had strange spaces or commas in them (e.g. “4 3,123”). I first tried to force these into numerics using as.numeric() in R, but the spaces made this fail. Because it was a relatively small dataset, I ended up fixing all of these in Excel. In the future, I’d try csvkit or OpenRefine.
This finally worked. To work with this data, I ended up using the reshape and dplyr libraries in R (thanks to Andy Zieffler for some advice here). In a future post, I’ll cover what I ended up doing to reshape the data for my purposes, but I’m going to stick with munging tasks for now.
The next task was a harder one: calculate the DMC rates. This is actually computing a Relative Rate Index for arrests, which is calculated by the equation:
(Minority_Arrests / Minority_Population) / (White_Arrests / White_Population).
To do this, I had to add to the dataset above by also getting the total populations by race. This turns out to be easier said than done. Had I known where to look, I might have been able to finish this step quickly, however, it took me about two hours because I had to search through many different data sources. I ended up finding this dataset downloadable from the Census American FactFinder project, which gave me what I needed to calculate the total population of adults and juveniles.
However, given that this is a summary dataset, the CSV had only a single row and about a million different columns, of which I only needed a few. I used OpenRefine to import, modify, and re-export this CSV file into something usable by R.
Quick Tip: OpenRefine doesn’t work by default on the newest versions of Mac OS X. Apparently Apple did something to Java that requires a different kind of application signature by the software manufacturer to work properly. Typing the following in your command line BEFORE installing OpenRefine when in the folder you downloaded OpenRefine to should fix this problem:
xattr -rd com.apple.quarantine google-refine-2.5-r2407.dmg
I found this solution here: Making OpenRefine work on OS X Mountain Lion+
I had to modify the column headers because they were too long and descriptive – unwieldy to use in R.
Finally, I ended up with two usable CSV files. At least 3 hours later. The analysis took paltry time in comparison. Lesson learned: data munging is indeed time consuming.
The other somewhat annoying lesson is that a process like this is rarely replicable, in part because it goes through so many different pieces of software to get to this point. At some point, I’d like to go back and write a script that does this munging all in one place so my analysis can be reproduced elsewhere, but for the time being, that’ll have to wait.