Reproducibility has become one of the most pressing issues in biology and many other computational-based research fields. This impasse has been fuelled by the combined reliance on increasingly complex data analysis methods and the exponential growth of big data. When considering the installation, deployment, and maintenance of computational data-analysis pipelines, an even more challenging picture emerges due to the lack of community standards. Moreover, the effect of limited standards on reproducibility is amplified by the very diverse range of computational platforms and configurations on which these applications are expected to be applied (workstations, clusters, HPC, clouds, etc.).
Software containers are gaining consensus as a solution to the problem of reproducibility of computational workflows. However, the orchestration of large containerised workloads at scale and in a portable manner across different platforms and runtime pose new challenges.
This presentation will give an introduction of Nextflow, a pipeline orchestration tool that has been designed to address exactly these issues. Nextflow is a computational environment which provides a domain specific language (DSL), meant to simplify the implementation and the deployment of complex large-scale containerised workloads in a portable and replicable manner. It allows the seamless parallelization and deployment of any existing application with minimal development and maintenance overhead, irrespective of the original programming language.
Paolo Di Tommaso is a computer scientist and bioinformaticians. He has 20 years long experience as software developer and architect. His main interests are parallel programming, HPC, cloud computing and containerisation technologies. He is an open source advocate and he is the creator and project leader of the Nextflow workflow framework.
Authors Paolo Di Tommaso, Research software engineer, Center for Genomic Regulation (CRG), Spain
Date: 21st June 2018
Location: Barcelona Advanced Industry Park