bedtools merge book-end merge

I never paid too much attention to this (I should have), but TIL that bedtools merge default is not only to merge overlapping regions but also to merge book-ended regions.

From the bedtools merge manual: -d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

What does this mean? With the default (-d 0) settings, bedtools merge will also merge the following regions:

# Two regions with book-end relationship - they are next to each other like books:
12345678
AAAA
BBBB
# Will get merged with the default bedtools merge -d 0 settings into:
12345678
AAAABBBB

To avoid merging book-ended regions, you have to change the default to -d -1. The example above will not get merged anymore and bedtools merge will merge only truly overlapping regions.

Note: bedtools merge is smart enough to correctly merge VCF files with END info field. This is useful, for example, when working with VCFs with structural variants, such as CNVs. You don’t have to convert the VCF into BED, merge, and convert back to VCFs.

Extract a single sample from multi-sample VCF

Sometimes you want to subset/extract just a single sample from a large multi-sample VCF. In my case, we wanted to get per-sample VCF from the 1000 Genome Project.

As there is always more than one way to skin a cat, I compared vcftools and bcfools. It turned out bcftools is ~50% faster than vcftools and then both produce identical output.

## Get a large VCF
wget ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz && tabix ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz

## Decide on a sample to extract
SAMPLE="NA18941"

## Extract and recode (keep only variants relevant for this sample) VCF for this specific sample without changing anything else
# bcftools
time bcftools view --samples $SAMPLE --min-ac=1         --no-version --no-update ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz > $SAMPLE.vcf
real    0m4.738s
user    0m4.717s
sys     0m0.017s

# vcftools
time vcftools --gzvcf ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz         --indv $SAMPLE --non-ref-ac-any 1 --recode --recode-INFO-all --out $SAMPLE
real    0m7.505s
user    0m7.491s
sys     0m0.012s

# Compare bcftools and vcftools output
 diff NA18941.vcf NA18941.recode.vcf
<empty>

bcftools is much faster (~50%) and much “smarter” than vcftools. By “smarter” I mean bcftools can directly process vcf.gz without any changes in the code. In vcftoolf, you have to change from --vcf to --gzvcf. Also, the vcftools has default output name $SAMPLE.recode.vcf whereas bcftools has stdout by default (so you can pipe it to bgzip or similar).

Is stop codon part of CDS? Yes, but GTF format is “special”

Recently, I was going through a mix of gene annotation GTFs from different sources and had to fix them. One of the differences was the annotation of stop codons. More specifically, whether the stop codon was part of the annotated CDS (coding sequence).

I had to refresh my biology knowledge and biologically speaking, stop codon is part of CDS. Although it doesn’t code for any amino acid, it is definitely coding for something rather than not. Simple, isn’t it? It would have been if everybody followed the rules, right GTF format?

In GTF format, the stop codon is not part of CDS by definition. In GFF, on the other hand, it is (definition). It can especially be confusing if you convert GFF to GTF (I strongly recommend using AGAT for any GTF to GFF and GFF to GTF conversions even though it’s written in Perl) and the source GFF doesn’t specifically list stop codon lines. In that case, the stop codon will be included in CDS even after GFF to GTF conversion. You’ll have to clip the last three nucleotides from every CDS, the last coding exon, and sometimes the gene and transcript.

If you want to learn more about all the GTFs, GFFs, and other Fs, check out this excellent knowledge hub on the NBIS GitHub page. Another nice discussion is this one on Biostars.

Saving space with SLOW5/BLOW5 (and POD5) compression of FAST5 files

Edit (01/24/2023): As mentioned in Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing, POD5 is another file format that is supposed to be more efficient than the good ol’ FAST5. It is being developed directly by Nanopore, so you can probably expect to see it more often in the near future. I will add it to the comparison when I find the time to do it.

It took me some time to test the new compression of the bulky FAST5 with slow5tools (Fast nanopore sequencing data analysis with SLOW5). Today, I finally almost ran out of space on our server so it was the right time. And I was pleasantly surprised.

Installing on a regular Linux x86_64 architecture is very simple:

$ VERSION=v0.4.0
$ wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
$ ./slow5tools

I tested it on older (2018) MinION direct RNA FAST5 files with 4k FAST5 files. The original size of the directory is 435M. Compression to BLOW5 (binary version of SLOW5) goes:

$ slow5tools f2s indir -d outdir_blow5

The compressed BLOW5 directory was 210M and took ~1s with the default number of processes (change it with -p INT). That’s ~50% of space saved!

There were some warnings, though:

[search_and_warn::WARNING] slow5tools-v0.4.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.

So it seems the conversion/compression removes some structure from the original FAST5. Let’s see what happened. Let’s convert it back to FAST5 with:

$ slow5tools s2f outdir_blow5 -d indir_decompressed

The decompressed (BLOW5->FAST5) directory has 323M. So it’s not a complete 1:1 conversion. The authors (here, here) say that FAST5 has a lot of wasted space which is removed during the compression. Also, SLOW5/BLOW5 doesn’t store the basecalled sequence (ONT stopped putting the basecalled sequence to FAST5 in their latest MinKNOW anyway – source is the same link) but my testing set didn’t have them to begin with. Also, slow5tools merged all the individual input read into 8 bigger FAST5 files (.

But let’s see what’s changed. We can just simply list the input and the output with HDF5 tools:

$ h5ls indir/DESKTOP_DO3S5TU_20180420_FAH85823_MN27347_sequencing_run_RNA_Control_73142_read_100_ch_110_strand.fast5 # One original read                                                                                                                                   
$ Analyses                 Group
$ PreviousReadInfo         Group
$ Raw                      Group
$ UniqueGlobalKey          Group
$ h5ls indir_fast5/0.fast5 | head -5 # One of the decompressed files
$ read_00c47597-464e-4263-94a1-b8794353d553 Group
$ read_0126ca44-9355-44a2-ac2a-5cb6b29715ff Group
$ read_012cffa1-d978-49f9-8746-91f501379dfd Group
$ read_0264ff46-f1d7-4983-a351-061dfe793b2d Group
$ read_032a1da5-9a9d-418a-831f-d17a3f7c2ca9 Group

We see that some fields from the original FAST5 are missing. This might be a problem if you have some uncommon or additional fields in your FAST5 that you would like to preserve. You will lose them with the compression.

But let’s see how it looks if we basecall the original and the decompressed FAST5 files. Run Guppy (I only have a very old version available 3.3.2+739c7e5):

$ guppy_basecaller \
    -x "auto" \
    --flowcell FLO-MIN106 \
    --kit SQK-RNA002 \
    --hp_correct on \
    --records_per_fastq 0 \
    --u_substitution off \
    --trim_strategy none \
    --input_path indir/ \
    --save_path indir/guppy/ \
    --recursive \
    --compress_fastq \
    --fast5_out

Basecalling of the original FAST5 directory took 12712ms. Basecalling of the decompressed directory took 12043ms. Size of the original basecalled output directory was 576M and of the decompressed 466M. That’s still a ~20% difference.

I then ran a simple fastqc on the basecalled fastq files and they were identical (good!). I am not going to publish here the outputs. You’ll have to trust me (and test it yourself, of course).

To sum it up – slow5tools can save a lot of space if you are like me with 50 MinION runs. But you have to be careful with the lost fields. Also, I haven’t tested it with bulk FAST5 files. I strongly recommend reading the closed issues tab at slow5tools GitHub. A lot of missing things are explained there. If you don’t care about them, BLOW5 is the way to go!

SLOW5 format is not only about compression! The format can also improve the processing of the Nanopore reads/files. Check out the publication for more information.

Is sambamba faster than samtools?

Update: January 4, 2022: I didn’t set RAM properly and didn’t have variable BAM depth which is fixed now.

samtools and sambamba are probably the most popular SAM/BAM handling tools. And once you are all set with your analysis setup it all goes down to speed. How do they compare in the most common tasks (view, index, sort, sort -n)? And how does it change with multithreading? Here is my simple summary.

I used human transcriptomic BAM (100k, 1M and 2M aligned reads) and the same for human genomic BAM. Everything ran on Ubuntu 16.04 on SSD (read 560MB/s, write 530MB/s) using time command and getting the real time.

reads100k1M2M
commandsamtools | sambambasamtools | sambambasamtools | sambamba
index; 1 thread0m0.107s | 0m0.029s0m0.718s | 0m0.666s0m1.613s | 0m1.423s
index; 12 threads0m0.022s | 0m0.048s0m0.182s | 0m0.203s0m0.347s | 0m0.341s
view; 1 thread0m0.075s | 0m0.069s0m2.815s | 0m0.985s0m5.881s | 0m1.947s
view; 12 threads0m0.077s | 0m0.084s0m0.651s | 0m0.786s0m1.230s | 0m1.432s
sort; 1 thread0m0.165s | 0m1.015s0m6.622s | 0m9.074s0m14.919s | 0m18.542s
sort; 12 threads0m0.184s | 0m0.912s0m1.024s | 0m2.932s0m2.336s | 0m4.160s
sort -n; 1 thread0m0.163s | 0m0.190s0m6.659s | 0m5.268s0m15.683s | 0m12.074s
sort -n; 12 threads0m0.210s | 0m0.081s0m1.087s | 0m0.896s0m2.683s | 0m1.943s
Table 1: samtools and sambamba comparison – transcriptomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

reads100k1M2M
commandsamtools | sambambasamtools | sambambasamtools | sambamba
index; 1 thread0m0.148s | 0m0.083s0m0.983s | 0m0.855s0m1.902s | 0m1.716s
index; 12 threads0m0.024s | 0m0.016s0m0.202s | 0m0.171s0m0.424s | 0m0.334s
view; 1 thread0m0.319s | 0m0.099s0m3.867s | 0m1.126s0m7.615s | 0m2.235s
view; 12 threads0m0.075s | 0m0.072s0m0.663s | 0m0.743s0m1.355s | 0m1.521s
sort; 1 thread0m0.812s | 0m0.815s0m8.434s | 0m8.919s0m17.051s | 0m17.989s
sort; 12 threads0m0.132s | 0m0.123s0m1.522s | 0m1.268s0m2.918s | 0m2.459s
sort -n; 1 thread0m1.013s | 0m0.888s0m11.729s | 0m9.909s0m23.510s | 0m19.239s
sort -n; 12 threads0m0.175s | 0m0.135s0m1.950s | 0m1.328s0m3.833s | 0m2.766s
Table 2: samtools and sambamba comparison – genomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

Is there a clear winner? I wouldn’t say so. Depends if you have transcriptomic or genomic BAM, how many threads you use, and what command (or their combination) you use.

sambamba seems to be faster in by name sorting (-n) and in genomic sorting in general. samtools is faster in by coordinates transcriptomic sorting. samtools is also faster in multi-threaded view while sambamba is faster in single-thread view. sambamba is slightly faster in indexing of bigger BAMs while indexing of smaller BAMs is 50/50. See Table 3 and Table 4 for recommendations (based on the largest tested BAM).

transcriptomesortsort -n viewindex
1 threadsamtoolssambambasambambasambamba
12 threadssamtoolssambambasamtoolssambamba
Table 3: samtools and sambamba recommendation – transcriptomic BAM.
genomesortsort -n viewindex
1 threadsambambasambambasambambasambamba
12 threadssambambasambambasamtoolssambamba
Table 4: samtools and sambamba recommendation – genomic BAM.

Note: It is important to realize samtools sort sets RAM per thread while sambamba sort (-m) sets total RAM. By default, samtools uses 768M per thread while sambamba uses 2GB total.

samtools v1.11
sambamba v0.8.0

# sorts
threads=1; input=test.bam; output=foo.bam
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 2G -o $output $input
sambamba sort -t $threads -m 2G -n -o $output $input

threads=12
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 24G -o $output $input
sambamba sort -t $threads -m 24G -n -o $output $input 

# index and view
threads=1; output=foo.sam
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

threads=12
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

Speed up deepTools computeMatrix

I like using deepTools in my analyses but it might get very slow if you work with big(er) files. Yes, it offers multithreading, but that often doesn’t help and using a lot of threads already killed our server once (it wasn’t me, I swear).

One of the slowest part I have encountered is the computeMatrix module. If your references (-R) are just a couple hundred lines, it will get veeeery slow. After a few days of frustration and waiting to get the matrix I have decided this is not the way to go.

It seems computeMatrix doesn’t scale well with increasing number of reference positions. It turned out to be much more effient to subsample the references, calculate matrices and then merge them together. A simple table of the runtimes looks like this:

Number of positions in -RTime / seconds
1,0004
5,00020
10,00058
25,000275
Fig 1. Runtime of deepTools computeMatrix on different number of reference (-R) positions. The original reference file was randomly sampled 3-times (shuf -l number_of_reads) for each number of reads and the runtimes were averaged.

It seems, that in this experiment, the sweet spot is somewhere around 5,000 positions but I think it might be different in every experiment. Since my references have around 100,000-150,000 positions splitting them to 5,000 chunks doesn’t overload the system much.

This is the final code looks like this:

#!/bin/bash
#
# Speed up deepTools computeMatrix by splitting the references into smaller chunks and then merging the matrices together
#
positions=5000
threads=12
rnd=$RANDOM
# split reference into chunks by number of lines
split -l $positions ref.bed ref.chunks${rnd}
for chunk in ref.chunks${rnd}*; do
# Rename name column (4) in bed to avoid potential problems which deepTools naming which might happen if the reference position name are not unique
name=$(basename $chunk)
name=${name##*.}
cat $chunk | awk -v name=$name 'BEGIN {FS = "\t"; OFS = "\t"} {print $1,$2,$3,name,$5,$6}' > tmp.$rnd && mv tmp.$rnd $chunk
done
# calculate matrix for each chunk
for chunk in ref.chunks${rnd}*; do
computeMatrix reference-point \
–referencePoint TSS \
-R $chunk \
-S input.bw \
-b 500 -a 500 \
–skipZeros \
–missingDataAsZero \
–binSize 10 \
–averageTypeBins median \
–numberOfProcessors $threads \
–outFileName ${chunk}.gz
done
# merge the chunks back to one file
computeMatrixOperations rbind -m ref.chunks${rnd}*.gz -o ref.matrix.gz && rm ref.chunks${rnd}*.gz
# make heatmaps
plotHeatmap \
-m ref.matrix.gz \
–sortUsing mean \
–averageTypeSummaryPlot mean \
–missingDataColor "#440154" \
–colorMap viridis \
–zMax 100 \
–linesAtTickMarks \
–refPointLabel "TSS" \
–heatmapHeight 20 \
–heatmapWidth 10 \
–dpi 300 \
–outFileName ref.png
rm ref.chunks${rnd}*