LInux – Jan's Blog of Bioinformatics Bits

Update: January 4, 2022: I didn’t set RAM properly and didn’t have variable BAM depth which is fixed now.

samtools and sambamba are probably the most popular SAM/BAM handling tools. And once you are all set with your analysis setup it all goes down to speed. How do they compare in the most common tasks (view, index, sort, sort -n)? And how does it change with multithreading? Here is my simple summary.

I used human transcriptomic BAM (100k, 1M and 2M aligned reads) and the same for human genomic BAM. Everything ran on Ubuntu 16.04 on SSD (read 560MB/s, write 530MB/s) using time command and getting the real time.

reads	100k	1M	2M
command	samtools \| sambamba	samtools \| sambamba	samtools \| sambamba
`index`; 1 thread	0m0.107s \| 0m0.029s	0m0.718s \| 0m0.666s	0m1.613s \| 0m1.423s
`index`; 12 threads	0m0.022s \| 0m0.048s	0m0.182s \| 0m0.203s	0m0.347s \| 0m0.341s
`view`; 1 thread	0m0.075s \| 0m0.069s	0m2.815s \| 0m0.985s	0m5.881s \| 0m1.947s
`view`; 12 threads	0m0.077s \| 0m0.084s	0m0.651s \| 0m0.786s	0m1.230s \| 0m1.432s
`sort`; 1 thread	0m0.165s \| 0m1.015s	0m6.622s \| 0m9.074s	0m14.919s \| 0m18.542s
`sort`; 12 threads	0m0.184s \| 0m0.912s	0m1.024s \| 0m2.932s	0m2.336s \| 0m4.160s
`sort -n`; 1 thread	0m0.163s \| 0m0.190s	0m6.659s \| 0m5.268s	0m15.683s \| 0m12.074s
`sort -n`; 12 threads	0m0.210s \| 0m0.081s	0m1.087s \| 0m0.896s	0m2.683s \| 0m1.943s

Table 1: samtools and sambamba comparison – transcriptomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

reads	100k	1M	2M
command	samtools \| sambamba	samtools \| sambamba	samtools \| sambamba
`index`; 1 thread	0m0.148s \| 0m0.083s	0m0.983s \| 0m0.855s	0m1.902s \| 0m1.716s
`index`; 12 threads	0m0.024s \| 0m0.016s	0m0.202s \| 0m0.171s	0m0.424s \| 0m0.334s
`view`; 1 thread	0m0.319s \| 0m0.099s	0m3.867s \| 0m1.126s	0m7.615s \| 0m2.235s
`view`; 12 threads	0m0.075s \| 0m0.072s	0m0.663s \| 0m0.743s	0m1.355s \| 0m1.521s
`sort`; 1 thread	0m0.812s \| 0m0.815s	0m8.434s \| 0m8.919s	0m17.051s \| 0m17.989s
`sort`; 12 threads	0m0.132s \| 0m0.123s	0m1.522s \| 0m1.268s	0m2.918s \| 0m2.459s
`sort -n`; 1 thread	0m1.013s \| 0m0.888s	0m11.729s \| 0m9.909s	0m23.510s \| 0m19.239s
`sort -n`; 12 threads	0m0.175s \| 0m0.135s	0m1.950s \| 0m1.328s	0m3.833s \| 0m2.766s

Table 2: samtools and sambamba comparison – genomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

Is there a clear winner? I wouldn’t say so. Depends if you have transcriptomic or genomic BAM, how many threads you use, and what command (or their combination) you use.

sambamba seems to be faster in by name sorting (-n) and in genomic sorting in general. samtools is faster in by coordinates transcriptomic sorting. samtools is also faster in multi-threaded view while sambamba is faster in single-thread view. sambamba is slightly faster in indexing of bigger BAMs while indexing of smaller BAMs is 50/50. See Table 3 and Table 4 for recommendations (based on the largest tested BAM).

transcriptome	sort	sort -n	view	index
1 thread	`samtools`	`sambamba`	`sambamba`	`sambamba`
12 threads	`samtools`	`sambamba`	`samtools`	`sambamba`

Table 3: samtools and sambamba recommendation – transcriptomic BAM.

genome	sort	sort -n	view	index
1 thread	`sambamba`	`sambamba`	`sambamba`	`sambamba`
12 threads	`sambamba`	`sambamba`	`samtools`	`sambamba`

Table 4: samtools and sambamba recommendation – genomic BAM.

Note: It is important to realize samtools sort sets RAM per thread while sambamba sort (-m) sets total RAM. By default, samtools uses 768M per thread while sambamba uses 2GB total.

samtools v1.11
sambamba v0.8.0

# sorts
threads=1; input=test.bam; output=foo.bam
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 2G -o $output $input
sambamba sort -t $threads -m 2G -n -o $output $input

threads=12
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 24G -o $output $input
sambamba sort -t $threads -m 24G -n -o $output $input 

# index and view
threads=1; output=foo.sam
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

threads=12
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

I like to make semi-automatic git repositories for every analysis I am working on. Because they are semi-automatic (=I don’t add file by file but rather exclude files I don’t want to track) it sometimes happens I accidentally add a huge file. And because I then put my semi-automatic git commits to crontab the huge file get buried deep into the git commit-tree structure. If you happen to be lazy like me it’s worth it to check the sizes of your git repos from time to time (or just look to the /var/mail/username for git commit repo size errors ). This is the easiest way to scan&clean repos I found so far.

Don’t waste your time with git filter-branch or git filter-index commands and just go straight for BFG Repo-Cleaner.

Install BFG Repo-Cleaner:

# Get Scala (for bfg)
$ cd ~/tools/
$ wget https://github.com/sbt/sbt/releases/download/v1.5.5/sbt-1.5.5.zip -O sbt-1.5.5.zip
$ unzip sbt-1.5.5.zip
$ mv sbt sbt-1.5.5
$ cd sbt-1.5.5/
$ ln -s $(pwd)/sbt ~/bin/sbt # Make link to your bin

# Install bfg; follow the instructions https://github.com/rtyley/bfg-repo-cleaner/blob/master/BUILD.md
$ cd ~/tools/
$ wget https://github.com/rtyley/bfg-repo-cleaner/archive/refs/tags/v1.14.0.tar.gz -O bfg-repo-cleaner-1.14.0.tar.gz
$ tar xvzf bfg-repo-cleaner-1.14.0.tar.gz
$ cd bfg-repo-cleaner-1.14.0/
$ sbt # <- start the sbt console
sbt: bfg/assembly # <- download dependencies, run the tests, build the jar

# Make link to you bin directory:
# Paste this (without the ">" symbol) into ~/bin/bfg
$ nano ~/bin/bfg
#   > #!/bin/bash
#   > java -jar /home/yourusername/tools/bfg-repo-cleaner-1.14.0/bfg/target/bfg-1.14.0-unknown.jar $@ 

# Test all is working as it should
$ bfg --help

You will also need git object size scanner. The easiest is to use either the main idea from here or use the method from the comment by kenorb (that’s what I am using).

Once you have the BFG and the size scanner and you know it’s working you can start with cleaning. Be extremely careful because the following steps will remove the objects/directories from the git repo completely. Once they are gone they cannot be brought back to life.

# Find big .gits
$ cd ~/
$ find . -type d -name ".git" -prune -exec du -sh {} \;

# Locate the big repo and check what are the big files
$ cd ~/projects/directory-with-huge-git-repo/
$ git big-files

# Check the big files (or directories), add them to .gitignore so they are not added again in the future, and commit the changes
$ nano .gitignore
$ git add .
$ git commit -m "Updated .gitignore before cleaning with BFG"

# Repack blobs for faster cleaning
$ git gc --prune=now --aggressive

And now you are ready for the BFG cleaning. Again, be extremely careful!

# Remove files bigger than x megabytes
$ bfg --strip-blobs-bigger-than 10M .git
# or remove files by name
$ bfg --delete-files test.*.tab .git
# or folders
$ bfg --delete-folders *bckp .git

# Clean and finalize the deletion
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

# Add changes, commit and push to master
$ git add .
$ git commit -m "Commit after cleaning with BFG"
$ git push -f origin master

# Confirm the repo is nice and small
$ du -sh .git

reads	100k	1M	2M
command	samtools \| sambamba	samtools \| sambamba	samtools \| sambamba
`index`; 1 thread	0m0.107s \| 0m0.029s	0m0.718s \| 0m0.666s	0m1.613s \| 0m1.423s
`index`; 12 threads	0m0.022s \| 0m0.048s	0m0.182s \| 0m0.203s	0m0.347s \| 0m0.341s
`view`; 1 thread	0m0.075s \| 0m0.069s	0m2.815s \| 0m0.985s	0m5.881s \| 0m1.947s
`view`; 12 threads	0m0.077s \| 0m0.084s	0m0.651s \| 0m0.786s	0m1.230s \| 0m1.432s
`sort`; 1 thread	0m0.165s \| 0m1.015s	0m6.622s \| 0m9.074s	0m14.919s \| 0m18.542s
`sort`; 12 threads	0m0.184s \| 0m0.912s	0m1.024s \| 0m2.932s	0m2.336s \| 0m4.160s
`sort -n`; 1 thread	0m0.163s \| 0m0.190s	0m6.659s \| 0m5.268s	0m15.683s \| 0m12.074s
`sort -n`; 12 threads	0m0.210s \| 0m0.081s	0m1.087s \| 0m0.896s	0m2.683s \| 0m1.943s

reads	100k	1M	2M
command	samtools \| sambamba	samtools \| sambamba	samtools \| sambamba
`index`; 1 thread	0m0.148s \| 0m0.083s	0m0.983s \| 0m0.855s	0m1.902s \| 0m1.716s
`index`; 12 threads	0m0.024s \| 0m0.016s	0m0.202s \| 0m0.171s	0m0.424s \| 0m0.334s
`view`; 1 thread	0m0.319s \| 0m0.099s	0m3.867s \| 0m1.126s	0m7.615s \| 0m2.235s
`view`; 12 threads	0m0.075s \| 0m0.072s	0m0.663s \| 0m0.743s	0m1.355s \| 0m1.521s
`sort`; 1 thread	0m0.812s \| 0m0.815s	0m8.434s \| 0m8.919s	0m17.051s \| 0m17.989s
`sort`; 12 threads	0m0.132s \| 0m0.123s	0m1.522s \| 0m1.268s	0m2.918s \| 0m2.459s
`sort -n`; 1 thread	0m1.013s \| 0m0.888s	0m11.729s \| 0m9.909s	0m23.510s \| 0m19.239s
`sort -n`; 12 threads	0m0.175s \| 0m0.135s	0m1.950s \| 0m1.328s	0m3.833s \| 0m2.766s

Category: LInux

Is sambamba faster than samtools?

Cleaning big git repositories