Software optimizations make variant calling 8 to 16 times faster

The imec ExaScience Life Lab developed a software tool – elPrep5 – for sequencing pipelines with variant calling. It produces results like established state-of-the-art genome analysis programs (SAMtools, Picard and GATK4®) but it speeds up the runtime by a factor 8 to 16, for both whole-exome and whole-genome data.

Summary

ElPrep5 speeds up variant calling because it parallelizes and merges the execution of the sequencing pipeline steps while minimizing the number of data accesses to files.

This software platform is interesting for pharma companies that use omics data to find new drug targets; for diagnostic and sequencing labs that want to speed up the customer experience; and for hospitals that want to implement personalized medicine and want to run analyses overnight.

In this article, we show detailed results on runtimes, memory and disk space usage and scaling potential of the elPrep software. As an example, it was benchmarked on state-of-the-art infrastructure.

The cost of sequencing has gone down tremendously. But still it is not used in daily practice. One of the reasons for this is that the processing of the raw data into useful insights takes a long time (several days for whole genome sequencing) and requires a lot of expensive resources (such as servers that need to be rented in a data center).

“The problem with current software for sequencing data and variant calling is that it is not structured in the best way,” explains Charlotte Herzeel from imec’s ExaScience Life Lab. “Every step in the pipeline (e.g. mark duplicates, base quality score recalibration or BQSR) is done by separate software tools that each require in- and output to a memory source and can only be executed when the former step is finalized. With our new software platform, we rethought this process and considered the pipeline as a whole.”

Figure 1: The variant calling pipeline typically consists of several consecutive steps that each require in- and output to a memory source. The idea behind elPrep is to consider the pipeline as a whole and implement parallelization to speed up the process.

elPrep5: parallelization for more efficient variant calling

ElPrep5 is the final version of the software platform that imec’s ExaScience Life Lab developed for the sequencing pipeline. This final update also includes the variant calling step, a step that typically takes up a substantial part of the total runtime (38-80%). ElPrep is developed in Linux and written in Go (a programming language developed by Google). It is released both as an open-source project on GitHub and as a premium license with support.

“The software optimization strategy consists of merging the execution of multiple pipeline steps, parallelizing their execution, and avoiding file I/O,” explains Herzeel. “It produces results like established state-of-the-art genome analysis programs such as SAMtools, Picard and GATK4®. This is important from a user’s perspective as it allows elPrep5 to be used as a drop-in replacement for other popular tools.”

Shifting up a gear: 8 to 16 times faster results

Several experiments were done to benchmark the efficiency of the elPrep5 software. “In a first experiment, a whole exome sequencing was performed using a 96CPUx384GB server,” says Herzeel. “The data were either sequenced with the widely-used GATK4® or the elPrep software. For GATK4®, two modes of the software were used: the Java (the standard haplotype caller algorithm) and Intel (algorithm optimized for parallelization) mode. Also, for elPrep, two modes were tested: the filter (loading all input data into RAM to avoid intermediate I/O to disk) and sfm (splits up the data by chromosomal regions) mode. Overall, the filter mode is useful for smaller data sets but uses more RAM. Whether the filter mode can be used depends on the size of the input BAM in relation to the available RAM.”

The results of this experiment are shown in figure 2 and 3.

Figure 2. Results from the benchmarking experiment on whole exome data, using both GATK4® and elPrep software.

On the whole exome data, the elPrep filter mode:

is 11 to 15 times faster than GATK4® (Java and Intel mode)
uses 4 times more RAM than GATK4® (Java and Intel mode)
uses almost 10 times less disk space than GATK4® (Java and Intel mode)

On the exome data, the elPrep sfm mode:

is 6 to 7.5 times faster than GATK4® (Java and Intel mode)
uses half the RAM of GATK4® (Java and Intel mode)
uses half the disk of GATK4® (Java and Intel mode)

“In the context of a cloud setup, the filter mode is overall the cheapest and most efficient mode to process the data, because, even though it uses more RAM than the sfm mode, the runtime is reduced so much that it reduces the server rental cost,“ explains Herzeel.

Figure 3. Results from the benchmarking experiment on whole genome data, using both GATK4® and elPrep software

For the whole genome data, a similar test was done. The only difference was that for elPrep only the sfm mode was used since the filter mode is not suitable for such large data sets.

On the genome data, the elPrep sfm mode:

is 16 times faster than GATK4® Java mode and 8.5 times faster than the Intel mode
uses 70% RAM of GATK4® (Java and Intel mode)
uses 70% of the disk of GATK4® (Java and Intel mode)

“ElPrep 5 shows more speedup for whole-genome data (16x!) than for whole-exome data,” summarizes Herzeel. “This is because variant calling, which is included in elPrep 5, typically takes up a larger portion of the overall runtime of a pipeline for whole genome data. Hence more time is spent proportionally on variant calling for whole-genome data and there is more computation for elPrep to speed up.”

The more resources, the faster you get (with elPrep)

Another experiment was done to study the ‘scaling’ potential of the elPrep software: how well does it perform in terms of reducing runtime when more hardware resources can be used. “If software scales well, the higher cost of renting servers with more computational resources can be compensated by the reduction in runtime,” comments Herzeel. “To do this test, the pipeline was executed on different servers with varying numbers of CPUs and RAM, using the widely-used GATK4® as reference.”

Results are shown in figure 4 (whole-exome) and figure 5 (whole-genome).

Figure 4. The runtime and cost for running the variant calling pipeline on whole-exome data, on a variety of servers.

For whole-exome data, elPrep (filter and sfm mode) scales very well. The runtime nearly halves for each increase of resources. Some concrete results that can be derived from the graphs:

the fastest elPrep run is a filter mode on 96CPUx384GB. This is some ten times faster and 5 times cheaper than the fastest run with GATK4®, Intel mode on 48CPUx192GB
the cheapest run is with GATK4®, Intel mode on 2CPUx8GB. However, it is more than 12 times slower than the fastest elPrep run while 4 times cheaper. For two times the price of the cheapest GATK4® run, you get a run with elPrep, sfm mode on 8CPUx32GB, that is two times faster.
If the user prefers the output of the GATK4® Java mode, then it is cheaper and faster to use elPrep: the run on 32CPUx128GB with elPrep filter mode is slightly cheaper and almost nine times faster than the GATK4® Java mode run on 2CPUx8GB.

Figure 5. The runtime and cost for running the variant calling pipeline on whole-genome data, on a variety of suitable servers.

For whole-genome data, only the sfm mode of elPrep was used, and the software was only run on servers with enough RAM and disk space for this large amount of data. One interesting lesson from the graphs in figure 4 is that:

the elPrep run on 96CPUx384GB is cheaper than the GATK4® Java run on 8CPUx32GB because the elPrep run is almost 14 times faster. More specifically, the cost goes down from 45 to 32 dollar and the runtime from almost 80h to less than 6h.

Next to servers from a large cloud provider, a test was also done on a system using Pure Storage infrastructure with a different storage architecture (FlashBlade). Results are shown in Figure 6.

Figure 6. Results of an experiment with elPrep running on both cloud (blue) and Pure Storage (green) infrastructure (using Intel and AMD-based servers). The servers are ranked on amount of RAM and CPUs. With the cloud provider the data resides on their default ‘hot’ data storage, whereas with Pure Storage, the data resides on a Flashblade server connected to the compute servers. The two graphs on the left represent analysis of whole-exome data while the outer right is done on whole-genome data.

Herzeel: “The experiment shows that the elPrep experiments on the Pure Storage infrastructure scale similarly to the cloud benchmarks. This suggests that the Flashblade storage solution of Pure Storage performs at least equally well to cloud provider’s storage solutions.”

Conclusion

ElPrep, and its latest update elPrep5 in particular, is a software for the analysis of sequencing data, including variant calling. It can be used as replacement for established state-of-the-art genome analysis programs such as a.o. SAMtools, Picard and GATK4®.

“Our benchmark experiment shows that elPrep 5 speeds up the pipeline execution by a factor 8 to 16x as compared to GATK4®,” concludes Herzeel. “Concretely, elPrep 5 executes the variant calling pipeline in less than 6h for a whole-genome sample, and needs less than 8 minutes for a whole-exome sample. ElPrep achieves these speedups using algorithmic innovations and parallelization, runs on regular CPU-based servers without specialized accelerators, and uses fewer RAM and disk resources.”

Want to know more?

Read the blogpost from Pure Storage, discussing the use of elPrep on their infrastructure. In this blog, Yves Mahieu, EMEA healthcare and life sciences director at Pure Storage states: “The match between imec and Pure Storage is logical because elPrep and FlashBlade share the same DNA: to simplify and accelerate high-performance workloads by building intelligent accelerators on top of open industry standards.”
In the 2021 paper ‘Multithreaded variant calling in elPrep 5’ you can find more details on this software platform and the benchmarks
The Exascience Life lab is an imec lab focused on providing software solutions for data-intensive high-performance computing problems, primarily in the life sciences domain. It focuses on large-scale machine learning for pharmaceutical companies, DNA sequencing software for hospitals and pharmaceutical companies, assay image feature extraction, advanced biostatistics and data analytics. Learn more.
Contact our experts for more information on this or related topics.

Charlotte Herzeel

Senior Researcher at imec, Exascience Lab

Charlotte Herzeel received her PhD in Computer Science from the Free University of Brussels in 2010. She then worked as a post-doctoral researcher at the Intel ExaScience Lab where her research focused on parallel programming, task scheduling and software resilience. In 2013 she joined imec as a researcher at the ExaScience Life Lab where she works on optimizing sequencing software and machine learning applications through parallel programming techniques. Major results include the open-source software packages elPrep and Halvade, which are published in highly ranked journals in the Bioinformatics community. Since its publication, elPrep is being used at multiple hospitals, bioinformatics and pharmaceutical companies, as well as third party sequencing platforms. Her research interests include software optimisation, software engineering, parallel and distributed programming, algorithms, machine learning, and bioinformatics.

Published on:

2 December 2021

Share this article on