Using docker for reproducible computational publications

Summary: Producing reproducible computational analyses is a growing concern. Many scientists are starting to publish their code and data, but it is often still a big hassle for others to run that code and reproduce results. Docker allows you to publish not only the necessary code and data, but also the exact compute environment used. Below I discuss ways Docker could be used to improve research reproducibility.

Reproducible research is hard

Just open up the Internet and you will see tons of talk about reproducible research: check it out on twitter, Nature, all over the blogosphere, and at more than a few genomics startups. Clearly, there is growing concern about how we can make our analyses more reliable and transparent.

Encouragingly, some publications are starting to actually publish the code and data used for analysis and to generate figures. For example, many articles in Nature and in the new Elife journal (and others) provide a "source data" download link next to figures. Sometimes Elife might even have an option to download the source code for figures.

This is a great start, but there are still lots of problems: if I really want to rerun analyses from these papers, it is actually pretty annoying. I download the scripts and data, I realize I don't have all the required libraries. Or that it only runs on an Ubuntu version I don't have, or some paths are hard-coded to match the authors' machines. If it's not easy for me to run and doesn't run "out of the box" the chances that I will actually ever run most of these scripts is close to zero.

For published code and data to be useful, it not only has to be reproducible, but it also has to be easy to reproduce and interact with the data. Otherwise no one will ever touch it. Right now, we post code and data and say: "it worked for me, good luck." But then the burden is on the user to figure out how in the world to get this stuff to run. Really, most of this burden should be on the authors if they want other people to use what they published.

I recently started experimenting with Docker, which might be a solution to many of these problems. Basically, it is kind of like a lightweight virtual machine where you can set up a compute environment, including all the libraries, code, data, you need, in a single "image". That image can be distributed publicly and can seamlessly run on theoretically any major operating system. No need for the user to mess with installation, paths, etc. They just run the docker you provided, and everything is set up to run out of the box. Many others (e.g. here and here) have already started discussing this.

I think it'd be pretty awesome if when we published a paper, we also provide a Docker image that is already set up to run all the analyses and recreate all the figures in the study. Below I discuss how to use Docker to do this, show an example Docker to reproduce figures in Rstudio from a published paper, and talk about what I think are the promises and problems. I'd love to have a discussion about how people think this kind of approach can or cannot solve reproducibility issues.

Using Docker for reproducible analyses

Here I want to describe a little bit how the setup works.

Docker has a repository of images called Docker hub that is pretty analagous to github except it's for Docker images instead of for code. You can pull, push, track, and modify images and share them with other people. For instance I made an image to run an Rstudio server by pulling the base ubuntu:14.04 image, modifying it to include everything needed to run Rstudio, and making a new image out of that (which I deposited at mgymrek/docker-rstudio-server, the github repo it's built from has the same name mgymrek/docker-reproducibility-example). Everything to build the image is specified in a Dockerfile that you can view there, so you know exactly how I created that environment.

A very cool feature is that Docker can link up with your github account. So you can make a github repository with a Dockerfile in it specifying how to build an image and what scripts to add to it. Docker will automatically detect when you change that repository, create a Docker image using the instructions in your Dockerfile, and post it as a "trusted build" (see automated builds).

So my vision of how this would work for a publication is that every publication would have its own github repository. There you can store all the code you used with version control. You would then include a Dockerfile in this repository specifying how to set up the exact environment required. Then you would end up with an automatically built docker image on docker hub that anyone can download and run out of the box.

When you run a docker container, you can specify a command that it should run when you first start it up. (The docker documentation explains all about this.). For instance, if you want the user to be able to play around inside, they can start it up with the command /bin/bash. However, you can also provide a Docker that doesn't require users to do any playing around in the shell. For instance, in the example below, I set it up so that it automatically runs the web browser version of rstudio with all the code and data set up, so that all one needs to do is run the docker, go to the web browser, and start having fun.

But that is only one way to do things. You could do a similar setup where you use IPython notebook rather than rstudio. Or you could provide one Makefile that will make all of the figures and reproduce analyses with a single command. Instead of providing the data in the Docker, you can provide the data as a separate download that users can mount to the Docker when they run it (which I think is better anyway, since it makes sense to separate the code from the data).

Below is a working example of how I got this to work going the Rstudio route. But there are countless ways to set this up and I'd love to discuss different possibilities.

Example - reproducing Martin, et al. PLOS Genetics 2014

I wanted to go through an excercise of how it would actually work to provide a Docker to reproduce analyses of a publication. As an example, I am using code and data from Transcriptome Sequencing from Diverse Human Populations Reveals Differentiated Regulatory Architecture by Martin and Costa et al. recently published in PLOS Genetics. The authors published the code and data used for analysis and figures along with the manuscript. First of all, I want to give the Bustamante Lab huge kudos for this. Although many people are talking about publishing code and data, few people are actually providing it, and I was very happy to see this. I also want to point out that the code I use in the example below is modified from the code they published with the paper, and that I received permission from the authors to post it.

Even though the data and code were provided, it was not immediate smooth sailing: the code required many R packages that I do not have installed. Some scripts required downloading scripts from a previous publication. It wasn't immediately clear which scripts went with which figures. And I had to change some hard coded paths in the scripts. But eventually I got the scripts to run with minimal modifications, so the figures really were reproducible.

This took a fair amount of fighting and set up time to install all the things I was missing. I can bet that few people are going to be patient enough to go through this process. But these are exactly the problems that Docker can solve: I can set up this compute environment once, and then provide it to other people in such a way that it just runs, almost regardless of what computing environment they're using.

For my example, I chose two supplemental figures from this study (Figure S7B and figure S1) and made a Docker that provides everything needed to reproduce those in Rstudio. I chose these mostly because they were the first scripts I looked at that ran in under a couple minutes and that I could easily tell which figure the scripts belonged to. The Docker is loaded with a running installation of Rstudio server, and all the code, data, and dependencies needed to produce those figures. Each figure is in a separate script that you can open in Rstudio and run out of the box to reproduce what you see in the published manuscript.

Run the docker yourself

I assume you have some familiarity with Docker and have it installed (if not the tutorials on the site are great and show you how much fun it is).

The Docker for this example is published on Docker hub mgymrek/docker-reproducibility-example. To run it, simply run:

docker run -p 49000:8787 -d mgymrek/docker-reproducibility-example

(You can replace 49000 with another open port if you want.) Navigate in your web browser to http://0.0.0.0:49000. This will open up rstudio. The username and password are both “guest”. You can open files in the scripts/ directory and run them to reproduce figures S7B and S1 from Martin et al. (martin_etal_figS1.R and martin_etal_figS7B.R). Voila, you didn't have to install rstudio, you didn't have to install any R packages, you didn't have to configure any paths, and it should just work (although this is my first Docker attempt, so give me a break if it doesn't!).

Technical Notes

If you're running on Mac, you should have already installed, initialized, and started boot2docker using the instructions here. Instructions are the same as above, except instead of using "0.0.0.0" as the IP address, run:
```
boot2docker ip
```
to get the IP, then navigate in the browser to $IP:49000.
You might get an rstudio error telling you that the figure margins are too large if the plot area is too small (e.g. if you're using a small screen or if you squished the plot area to be smaller). In that case, enlarge the plot area and run again.
All the data required was provided in this docker. Ideally the data would be separate from the code in either another data container or mounted to a directory when you run docker. I didn't do that here because this issue doesn't allow mounting whole directories on OSX. If that is fixed I might update this.

Challenges

I am no expert, but I think Docker is great, and solves a lot of problems. That being said, there are still challenges:

"It works everywhere, except when it doesn't".
Supposedly this should work on most common operating systems. I did get this to run on my Mac (which because my Mac is old and dying and was out of disk apce was very painful, but otherwise it's easy to set up). There were some quirks/bugs to running it on Mac though, so still not totally smooth sailing. Also, several older Ubuntu distributions wouldn't work, or at least I gave up before they did. But any new-ish Linux/Mac/Windows system should work reasonably well.
How will this scale for studies working with huge datasets?
It's great to provide analyses that can be completely rerun by other researchers. But sometimes it's just not feasible for someone else to go rerun your analysis of 200TB of sequencing data that you ran across hundreds of nodes on the cloud. I don't know what the perfect reproducibility solution is for such huge studies, but maybe there is a partial solution. You can at least provide the code and computing environment that was used for the analyses. Ideally you would provide a subset of the data that another person can rerun and see that they get the same results. For instance, if your study processed 3,000 sequencing datasets, provide a Docker with the code that will allow someone to rerun the analysis on a single or several genomes. Also provide the processed datasets and the code used to create figures and downstream analyses.
Docker should not replace writing good code
One argument against the Docker solution is that you can still publish crappy code and distribute it in a docker. You could even supply just a binary file with no source code in a docker. But that wouldn't be very useful or in the spirit of reproducibility. Wouldn't a better solution be if you just made your code open source and able to compile and run easily on most reasonable systems?
I don't think Docker should replace the need to write good, clean easily runnable code. But I do think there is a tradeoff here between how much time a bioinformatician spends getting the code to run everywhere vs. how much time the user has to invest. The bioinformatician could spend a month making his or her code run smoothly everywhere, or could spend a day making it work once on a Docker, with probably a little bit of extra effort on the part of users to use Docker. But of course neither of these options should give an excuse to write bad code.
Related to this, I don't think Docker is the solution to everything. I think if you have some R/Python/etc. scripts that you used for analysis and to make figures, Docker is a great way to make them easily runnable by other people in such a way that does not require them to install a bunch of libraries, deal with setting up paths, etc. BUT, if you're publishing a tool for which you expect heavy duty use from a lot of people on their own datasets (e.g. GATK, bwa, Bowtie, etc.), it's not enough to put it in a Docker and have it run. If you want people to use your tool, it's going to have to run well on a variety of systems and you're going to have to put in that effort.

Conclusion

I would love to hear what others think and discuss this and other options for improving reproducibility. Like I mention above, I don't think Docker is the perfect solution, or that a single perfect solution exists, but I think it solves a lot of problems and is a step in the right direction.

Special thanks to Assaf Gordon and Alon Goren for all the great conversations about this.