Next-generation sequencing has led to a rapid increase in the volume of data generated for research. Core facilities and institutions typically have the computational resources to store and manage this data centrally; however, it is often beneficial to do local processing on the data sets. Unfortunately, it is uncommon to find the petabyte-scale storage that may be required for even moderate-sized sequencing studies within individual labs. To work around this limitation, individual specimens or subjects can be processed sequentially, but taking a manual approach can greatly increase the total processing time. To enable an automated processing workflow, we recently took advantage of the Luigi workflow library for Python and Docker containers to process samples from the TCGA project.
The main step requiring significant computational and storage capacity was somatic variant calling. For each specimen, we needed to have the germline and tumor sequence files (approximately 100-200GB each) available for processing. These files were to be downloaded from cgHub with Gene Torrent and variants identified with the Somatic Sniper software.
Application virtualization is quickly becoming the standard for deployment. In addition to the isolated, reproducible environment, in our current workflow, Docker containers also allowed us to run multiple specimens simultaneously. While we could not capture the entire TCGA data set at once, our hardware did support staggered processing of multiple specimens to take complete advantage of our network, compute, and memory capacity. Every step of the above workflow used a Docker container, with the Luigi management also running in a container.
Luigi Workflow Management
Luigi is a workflow management library initially developed by Spotify. It can manage complex workflows and automatically re-queue failed tasks. The standard Luigi library worked well for this use case, but a bioinformatics-specific Luigi fork is also available (Sci:Luigi). To use our Docker-oriented approach, we deployed Luigi and our workflow to a master Docker container and shared the Docker socket through a mounted volume. This allowed our Luigi workflow to dynamically launch Docker containers on the host for each workflow task. For each specimen, the Docker-Luigi wokflow would:
- Launch two cgDownload containers to download the germline and tumor sequence files
- Upon completion, run the SomaticSniper container to identify somatic variants
- Cleanup the BAM files and process the next set of sequences
This approach allowed us to efficently identify somatic variants on very large data set, without needing to purchase the petabyte of local storage resources that the aligned sequence files would require. While reanalysis from the original BAM files would require repeat downloads, most of our current downstream analysis only requires the somatic variants, which take significantly less storage space.