Update: January 4, 2022: I didn’t set RAM properly and didn’t have variable BAM depth which is fixed now.
samtools
and sambamba
are probably the most popular SAM/BAM handling tools. And once you are all set with your analysis setup it all goes down to speed. How do they compare in the most common tasks (view
, index
, sort
, sort -n
)? And how does it change with multithreading? Here is my simple summary.
I used human transcriptomic BAM (100k, 1M and 2M aligned reads) and the same for human genomic BAM. Everything ran on Ubuntu 16.04 on SSD (read 560MB/s, write 530MB/s) using time
command and getting the real
time.
reads | 100k | 1M | 2M |
---|---|---|---|
command | samtools | sambamba | samtools | sambamba | samtools | sambamba |
index ; 1 thread | 0m0.107s | 0m0.029s | 0m0.718s | 0m0.666s | 0m1.613s | 0m1.423s |
index ; 12 threads | 0m0.022s | 0m0.048s | 0m0.182s | 0m0.203s | 0m0.347s | 0m0.341s |
view ; 1 thread | 0m0.075s | 0m0.069s | 0m2.815s | 0m0.985s | 0m5.881s | 0m1.947s |
view ; 12 threads | 0m0.077s | 0m0.084s | 0m0.651s | 0m0.786s | 0m1.230s | 0m1.432s |
sort ; 1 thread | 0m0.165s | 0m1.015s | 0m6.622s | 0m9.074s | 0m14.919s | 0m18.542s |
sort ; 12 threads | 0m0.184s | 0m0.912s | 0m1.024s | 0m2.932s | 0m2.336s | 0m4.160s |
sort -n ; 1 thread | 0m0.163s | 0m0.190s | 0m6.659s | 0m5.268s | 0m15.683s | 0m12.074s |
sort -n ; 12 threads | 0m0.210s | 0m0.081s | 0m1.087s | 0m0.896s | 0m2.683s | 0m1.943s |
samtools
and sambamba
comparison – transcriptomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.reads | 100k | 1M | 2M |
---|---|---|---|
command | samtools | sambamba | samtools | sambamba | samtools | sambamba |
index ; 1 thread | 0m0.148s | 0m0.083s | 0m0.983s | 0m0.855s | 0m1.902s | 0m1.716s |
index ; 12 threads | 0m0.024s | 0m0.016s | 0m0.202s | 0m0.171s | 0m0.424s | 0m0.334s |
view ; 1 thread | 0m0.319s | 0m0.099s | 0m3.867s | 0m1.126s | 0m7.615s | 0m2.235s |
view ; 12 threads | 0m0.075s | 0m0.072s | 0m0.663s | 0m0.743s | 0m1.355s | 0m1.521s |
sort ; 1 thread | 0m0.812s | 0m0.815s | 0m8.434s | 0m8.919s | 0m17.051s | 0m17.989s |
sort ; 12 threads | 0m0.132s | 0m0.123s | 0m1.522s | 0m1.268s | 0m2.918s | 0m2.459s |
sort -n ; 1 thread | 0m1.013s | 0m0.888s | 0m11.729s | 0m9.909s | 0m23.510s | 0m19.239s |
sort -n ; 12 threads | 0m0.175s | 0m0.135s | 0m1.950s | 0m1.328s | 0m3.833s | 0m2.766s |
samtools
and sambamba
comparison – genomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.Is there a clear winner? I wouldn’t say so. Depends if you have transcriptomic or genomic BAM, how many threads you use, and what command (or their combination) you use.
sambamba
seems to be faster in by name sorting (-n
) and in genomic sorting in general. samtools
is faster in by coordinates transcriptomic sorting. samtools
is also faster in multi-threaded view
while sambamba
is faster in single-thread view. sambamba
is slightly faster in indexing of bigger BAMs while indexing of smaller BAMs is 50/50. See Table 3 and Table 4 for recommendations (based on the largest tested BAM).
transcriptome | sort | sort -n | view | index |
---|---|---|---|---|
1 thread | samtools | sambamba | sambamba | sambamba |
12 threads | samtools |
| samtools | sambamba |
samtools
and sambamba
recommendation – transcriptomic BAM.genome | sort | sort -n | view | index |
---|---|---|---|---|
1 thread |
| sambamba | sambamba | sambamba |
12 threads |
|
| samtools | sambamba |
samtools
and sambamba
recommendation – genomic BAM.Note: It is important to realize samtools sort
sets RAM per thread while sambamba sort
(-m
) sets total RAM. By default, samtools
uses 768M per thread while sambamba
uses 2GB total.
samtools v1.11
sambamba v0.8.0
# sorts
threads=1; input=test.bam; output=foo.bam
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 2G -o $output $input
sambamba sort -t $threads -m 2G -n -o $output $input
threads=12
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 24G -o $output $input
sambamba sort -t $threads -m 24G -n -o $output $input
# index and view
threads=1; output=foo.sam
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output
threads=12
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output