Categories
Just Data

Browser-based Data Manipulation with Miso.Dataset

This is the first in a series of posts about browser-based data manipulation and visualization.

Anyone hoping to jump on the web data visualization bandwagon is likely immediately intimidated when they start trying to figure out the nitty gritty of how to get various visualization libraries to work. Some like D3 and Google Charts have been around for ages. Other more opinionated and specialized libraries like Chartist are relatively new on the scene. The problem is that there’s about a million more. You can view and contribute to the one’s I’ve aggregated so far in my Data Science & Visualization Tools spreadsheet.

I’m writing this series of blog posts in the hopes that they can walk beginners through the various challenges of different steps in the process and different decision-points that might lead them to invest in a particular library.

So, you want to do some browser-based data viz? Cool – we’re going to figure it out together. Feel free to Tweet me questions along the way @alexfink. For reference, I’m not a professional developer, nor a professional quantitative data scientist. I consider myself first and foremost a critical qualitative researcher. Over the last few years, I’ve increasingly realized the importance of including quantitative data analysis in my work. Inspired by this book and some personal experiences, I’ve seen the ways that statistics and compelling data visualization can cause perspective shifts in important ways.

In my doctoral studies, I learned to do statistics in the open source programming language R. R is a relatively easy and robust language with a very large set of packages that allow for almost any kind of data analysis. R uses the command line of your operating system, but also has a graphical user interface called R Studio. I learned R and I liked R. But the interactive data visualizations – the ones being used all over the web – are built in HTML, CSS, and JavaScript (JS). These are the languages of the web and I wanted to be able to use them to do basic data analysis and visualization.

Let me be clear here. As far as I can tell, you are much better off doing regression analysis and other complex data analysis in R, Python or another statistical package. The data analysis libraries for JS aren’t nearly as robust. I won’t discuss most of that here (there’s loads of resources, including MOOCs and open access books that can help you learn this kind of analysis). The purpose of this series is to talk about taking a dataset from a CSV or Excel file and visualizing it online, using a variety of tools and libraries available.

Let’s get started with something basic: making a simple bar chart. Something like this:

Here’s the steps involved:

  1. Get a dataset to work with
  2. Make calculations on the data
  3. Choose a charting library
  4. Manipulate the data into a format that fits library specifications
  5. Use the library’s API to make a chart

That’s a lot of steps to make a chart you could’ve built in Excel in 5 minutes, right? I think so too. There are tools that help manage simple charts like this more easily in HTML/JS, but the more “opinionated” a tool (meaning the more it forces you to do things a certain way) the less flexibility you have in terms of what you can create. So learning to do all these steps will give you more power and flexibility in the future.

Step 1: Get a dataset

Get a dataset to work with. Preferably, find something in CSV format (Comma Separated Values). CSV is a standard format for data on the web. In this example, I’ll be using the VLSSpercapita.csv data set from this link. The first few rows of the dataset look like this:

ID,Dollars,Area,Region
1,184.33,Rural,5
2,62.73,Rural,5
3,119.13,Rural,5
4,76.61,Rural,5
5,97.46,Rural,5
6,132.09,Rural,5
7,99.86,Rural,5

As you can see, the column titles occupy the first row, separated by commas. The following rows match up to each of the columns with data, separated by commas. Technically, a CSV can use anything as a separator, but a comma is a broadly used standard.

Got this file downloaded? Great. Create a new folder on your computer and drag this file into it. Then create a new file entitled “index.html”. This is the web page that will display your chart.

Step 2: Make calculations

Most of the time, you aren’t handed data you can just visualize. You need to manipulate it in some way. For example, what kind of worthwhile visualization would I create out of this CSV with ~6,000 rows? Any visualization I can imagine would be sorta useless.

But we could create something meaningful if we manipulate the data a bit. For example, our data headings (and the data codebooks available on the above mentioned page) indicate that we have in Dollars the expenditure per capita of a surveyed household. We also know the Region in which this household resides. A very simple calculation might tell us what the average Dollars expended per capita in households in any given Region might be. This data proves to be useful by offering us a summary of an otherwise unmanageable amount of data.

So… how do we do this? This CSV appears to be sorted by Region, so we could go through and cut the document into pieces by Region. We’d quickly find out, however, that the document only appears to be sorted by Region. Which leaves us back where we started. So we need something that can manipulate the data and develop summary understandings with it. We could do this quite easily in R, but let’s say for funsies that we want to stay in JS (as I’d prefer to do). Unlike R, JS doesn’t have a built-in data summary function.

This means that we need to make one or use an already existing library. There are a number of libraries, like lodash/underscore that get us part way there by giving us great variable manipulators for JS. But they aren’t built to work specifically for data analysis. A great library I discovered that offers a minimalistic toolset for working with data in Javascript is Miso.Dataset. While relatively young, I found this project to provide the tools I needed to do some basic data manipulation in the browser.

So open up that index.html file in a text editor (or better yet, a code editor like Atom) and get hacking. Try this out to start with:

If you run this in a browser, you won’t see anything! Don’t worry about it. You’ve still gotta fix something. See the part that says <script src=”miso.ds.deps.min.0.4.1.js”></script>? Well, you need that to point to the Miso library. Which means we’ve got to download it. Download the Miso.Dataset JS file to your newly created folder. This link includes the dependencies you’ll need (and if that doesn’t make sense, don’t worry about it for now). To get it to work, you just need to make sure that src=”” statement points to a file by the same name that sits in the same folder as your index.html. That’s it – now you’ve got the Miso.Dataset library setup to run on your web site. If we refresh the index.html web page, you should still see nothing! But now, your code should be running well. Now you’ll see nothing because we haven’t done anything yet. At this point, all our code does is setup a basic HTML page, load the Miso.Dataset library, and prepare us to load the VLSSpercapita.csv file, noting that commas are going to be our delimiters. Well, that’s not very satisfying. But we’ve got the foundation in place for what we’re gonna do next. The next step is to really play around with this data – to group it by Region and then to get the mean Dollars of each Region. Here’s the code to update your index.html, and I’ll explain the rest at the bottom of it.
The top part of our code stayed the same, but then we added this .fetch() thing. What’s that? Well, you can read the documentation on the Miso website. In short, fetch is used to pull in a local or remote dataset. The way we’ve used it here, it pulls in the data asynchronously, meaning that it draws in the data over time and will continue to execute other functions that occur after it. However, we haven’t put anything after the fetch, but instead, inserted it into a success callback. This callback is executed once the fetch is successfully completed. That way we know the data has been fully loaded into the browser window. Don’t worry about this for now — it would mainly be an issue down the road when you are using a massive dataset or building a web site for the public to see. For now, it’s safe on our local computer and as long as we keep the rest of our workflow within the success callback, we’re golden.

I’m not going to explain the rest of the JS here – there’s loads of JS tutorials out there – but you can look up callbacks, event handlers, and anonymous functions if you want to understand a few things going on here a bit more deeply. The next thing we do with our data is log a couple of things to our browser console to show that things are working properly. You can access your browser console under “Tools -> Web Developer -> Toggle Tools” in Firefox and “Tools -> Developer” in Chrome. Where you see this:

console.log("Available Columns: " + this.columnNames());
console.log("There are " + this.length + " rows");

The text in quotes will be logged to the console. It will also show the names of the columns Miso.Dataset detected and the total number of rows.

Next is where the real work comes in. We now use Miso’s groupBy function. The first argument we provide is the name of the column we want to group by. The second is an array (as designated by []) of the other columns we want to include. We need to provide the latter part because we are in effect creating a whole new derivative dataset based on our initial dataset. There’s lots of language-specific reasons why we’re doing that, but this is also common to R and Python as well.

The next argument is an object with the attribute “method”, which in turn is handed an anonymous function. This is a lot of technical mumbo jumbo to basically say that we’re overriding the default action of the groupBy method for combining things into groups. The default is sum, but we don’t want a sum, we want an average! So we provide our own function here.

This function appears a bit complicated… and it is! It’s a complicated way to get an average across an array (again, something that would be quite easy to do in R or Python). But don’t worry about it. Instead, Google it! Whenever you get stuck at a point like this, try something like “javascript average across array” and you’ll find examples of what you’re looking for. I’m not going to explain our particular functionality here, because it dives too much into JS, but try looking at the docs for the lodash library if you want more information.

Finally, we log our new data to the console. Specifically, we are logging the Dollars column so we can see if our averaging worked out.

Step 3: Choose a charting library

This is a difficult choice because there are about a million. At this point, just pick one and give it a try. If you get really involved in this work, it will be worth exploring the many options and considering the various characteristics you might value: ease of use, scalability, speed, flexibility, responsiveness, capabilities, documentation, community, and so forth.

For the purposes of this post, I’m going to use ChartJS. ChartJS is a relatively robust set of responsive charts that use the Canvas element of HTML5 to paint out-of-the-box attractive graphs. It has great documentation, which makes it easy to get started. Once we do this quick example, you should be able to adapt it to making a line graph or a bar chart fairly easily.

Step 4 & 5: Finishing our graph

So, here’s the final code for your index.html page. Read through it and notice especially what’s changed. I’ve thrown in some comments (denoted by /* */ or //) that explain what’s going on, but I’ll explain it more detail after the code.

We haven’t added much here. In fact, the new code comes almost straight out of the ChartJS documentation. Basically, we’ve created a new object (denoted by {}) that has two attributes, “labels” and “datasets”, both of which are arrays (denoted by []). The labels, in order, denote the names that are attributed to the numbers in our CSV file (which I found by looking at the code book). If you look at the code book, you’ll notice they aren’t in order. That’s because the groups aren’t ordered sequentially by number! If you look again at your browser console, you’ll see that the console log of Region puts them in a different order. I simply matched the order of these to the right name for that region in the code book. But order is important here – the labels will match up only if we put them in the same order as the actual Regions are listed in the array output by the console.log().

The datasets array is where the magic happens. datasets is actually an array of objects, where each object represents one dataset. Each dataset object includes various options, in addition to the data itself. Here, I’ve included various color changes and a label. But the most important part is the data piece. Essentially, the data attribute takes an array of values. In the console.log we did of the grouped Dollars data, we saw this: Array [ 144.98576251455194, 219.60131914893623, 166.8890960451977, 195.5782360742705, 139.15073369565212, 350.4979667644188, 195.40419064748207 ]. This array is what we want to show up in our graph as our Y axis data, as it represents the averages of each region. So we use data: barData.column("Dollars").data again, which inputs the data we want into our graph.

We need to do two final things. We’ve now got data in the right format to go into our ChartJS library. This would look different if we decided to use a different library, depending on the format the library requires. To finish our chart in ChartJS, we now need to find the Canvas element we created earlier and ask ChartJS to actually create the chart. The final two calls do just that. The first identifies the Canvas element on the page, while the second uses the ChartJS library to map the chart.

… and viola! You’ve made a chart using entirely web technologies. Congratulations. I encourage you to play around with this — change the fill colors, labels, or if you get really creative, the way you are doing the data manipulation.

Ask questions. More to come.

One reply on “Browser-based Data Manipulation with Miso.Dataset”

Leave a Reply