Many people shudder at the thought of Make files. They can often seem archaic and overly complicated tools reserved for only the most hardcore Linux hackers. But you may be surprised at their awesome ability help with managing large data files in collaboration. This is particularly helpful when collaboratively working on data visualizations of large data sets.
Lets walk through a hypothetical scenario. Say you and your friends decide to hack on an open data visualization project that involves large amounts of data (but large enough to fit on a hard drive.) The data set provided needs to be extensively modified and reformatted but because the data set is so huge, you don't want to have to keep recommitting every change. Enter, make.
Make makes it easy to describe file dependencies. You tell make "this is the 'recipe' to create this file" where a recipe may require additional sub recipes. Let's use the simple example of describing a recipe to decompress a hypothetical compressed csv file
data/records.csv: data/records.csv.gz gunzip -c data/records.csv.gz > data/records.csv
Now, we can run
make data/records.csv and make will run the commands in the recipe to create the
data/records.csv file. Notice what happens if you try to run the command a second time. You'll get the following output:
make: `data/records.csv' is up to date.
Make is telling us there's no need to run the recipe again because the file already exists and none of the recipes dependencies have changed. (Make can tell this by looking at the files last-modified date, seeing that its up to date with that last time Make ran.) If we for whatever reason changed the original compressed file or if we deleted the generated
data/records.csv file, Make would then know to rerun the recipe if we ran
make data/records.csv again. If the data files in question are large, this is extremely helpful. But this also comes in handy if we have other recipes that depend on
Now say we wanted to perform some operation on the
records.csv file. This operation cannot run if the
records.csv doesn't yet exist. In this case, Make is smart enough to ensure the recipe for creating
records.csv runs before our new recipe.
data/records.edited.csv: data/records.csv scripts/transform.js cat data/records.csv | node scripts/transform.js > data/records.edited.csv
Now, if someone wanted to add some other modification to the CSV file, they can add their changes to the
scripts/transform.js script and rerun
make data/records.edited.csv. If we deleted
data/recors.csv make would know to rerun that recipe before running
For small files, this might not seem very useful but when dealing with large files in collaboration, this can be very handy. Using this technique in combination with a version control tool like Git, you can commit just the original compressed version of the
data/records.csv.gzip and the transform scripts. Then, anytime a collaborator wants to make a change to the records, they can edit and commit their change to the transform script instead of committing all their changes to the data file directly. Not doing so would cause minor changes in formatting of each record to take up massive amounts of space in each commit to Git. By only committing the transform operation, the Git history is much more manageable and doesn't cause your collaborators to wait endlessly to pull in your changes from master. They can instead quickly run
git pull and then
Lastly, if typing
make data/records.edited.csv seems to verbose, you can add an additional empty recipe like the following:
Make also has what's called a
Phony recipe. The following can be used to create a phony Make recipe that will always run, regardless of the last modified dates of its dependencies. This is convenient when you want to have special commands that always run across your project, such as cleaning up old or generated files.
clean: rm data/records.csv rm data/records.edited.csv .PHONY: clean
.PHONY: data recipe at the end tells make that
clean isn't really a file recipe and shouldn't be confused with one encase a file named
clean happens to be in my directory already.
Now to clean up our project, we can run