Setosa blog

home

Using GNU Make for collaborative data visualization

Many people shudder at the thought of Make files. They can often seem archaic and overly complicated tools reserved for only the most hardcore Linux hackers. But you may be surprised at their awesome ability help with managing large data files in collaboration. This is particularly helpful when collaboratively working on data visualizations of large data sets.

Lets walk through a hypothetical scenario. Say you and your friends decide to hack on an open data visualization project that involves large amounts of data (but large enough to fit on a hard drive.) The data set provided needs to be extensively modified and reformatted but because the data set is so huge, you don't want to have to keep recommitting every change. Enter, make.

Make makes it easy to describe file dependencies. You tell make "this is the 'recipe' to create this file" where a recipe may require additional sub recipes. Let's use the simple example of describing a recipe to decompress a hypothetical compressed csv file data/records.csv.gz.

data/records.csv: data/records.csv.gz
    gunzip -c data/records.csv.gz > data/records.csv

Now, we can run make data/records.csv and make will run the commands in the recipe to create the data/records.csv file. Notice what happens if you try to run the command a second time. You'll get the following output:

make: `data/records.csv' is up to date.

Make is telling us there's no need to run the recipe again because the file already exists and none of the recipes dependencies have changed. (Make can tell this by looking at the files last-modified date, seeing that its up to date with that last time Make ran.) If we for whatever reason changed the original compressed file or if we deleted the generated data/records.csv file, Make would then know to rerun the recipe if we ran make data/records.csv again. If the data files in question are large, this is extremely helpful. But this also comes in handy if we have other recipes that depend on data/records.csv.

Now say we wanted to perform some operation on the records.csv file. This operation cannot run if the records.csv doesn't yet exist. In this case, Make is smart enough to ensure the recipe for creating records.csv runs before our new recipe.

data/records.edited.csv: data/records.csv scripts/transform.js
    cat data/records.csv | node scripts/transform.js > data/records.edited.csv

Now, if someone wanted to add some other modification to the CSV file, they can add their changes to the scripts/transform.js script and rerun make data/records.edited.csv. If we deleted data/recors.csv make would know to rerun that recipe before running data/records.edited.csv.

For small files, this might not seem very useful but when dealing with large files in collaboration, this can be very handy. Using this technique in combination with a version control tool like Git, you can commit just the original compressed version of the data/records.csv.gzip and the transform scripts. Then, anytime a collaborator wants to make a change to the records, they can edit and commit their change to the transform script instead of committing all their changes to the data file directly. Not doing so would cause minor changes in formatting of each record to take up massive amounts of space in each commit to Git. By only committing the transform operation, the Git history is much more manageable and doesn't cause your collaborators to wait endlessly to pull in your changes from master. They can instead quickly run git pull and then make data/records.edited.csv.

Lastly, if typing make data/records.edited.csv seems to verbose, you can add an additional empty recipe like the following:

data: data/records.edited.csv

Make also has what's called a Phony recipe. The following can be used to create a phony Make recipe that will always run, regardless of the last modified dates of its dependencies. This is convenient when you want to have special commands that always run across your project, such as cleaning up old or generated files.

clean:
    rm data/records.csv
    rm data/records.edited.csv

.PHONY: clean

The .PHONY: data recipe at the end tells make that clean isn't really a file recipe and shouldn't be confused with one encase a file named clean happens to be in my directory already.

Now to clean up our project, we can run

make clean
comments powered by Disqus