bedtools merge book-end merge

I never paid too much attention to this (I should have), but TIL that bedtools merge default is not only to merge overlapping regions but also to merge book-ended regions.

From the bedtools merge manual: -d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

What does this mean? With the default (-d 0) settings, bedtools merge will also merge the following regions:

# Two regions with book-end relationship - they are next to each other like books:
12345678
AAAA
BBBB
# Will get merged with the default bedtools merge -d 0 settings into:
12345678
AAAABBBB

To avoid merging book-ended regions, you have to change the default to -d -1. The example above will not get merged anymore and bedtools merge will merge only truly overlapping regions.

Note: bedtools merge is smart enough to correctly merge VCF files with END info field. This is useful, for example, when working with VCFs with structural variants, such as CNVs. You don’t have to convert the VCF into BED, merge, and convert back to VCFs.

Extract a single sample from multi-sample VCF

Sometimes you want to subset/extract just a single sample from a large multi-sample VCF. In my case, we wanted to get per-sample VCF from the 1000 Genome Project.

As there is always more than one way to skin a cat, I compared vcftools and bcfools. It turned out bcftools is ~50% faster than vcftools and then both produce identical output.

## Get a large VCF
wget ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz && tabix ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz

## Decide on a sample to extract
SAMPLE="NA18941"

## Extract and recode (keep only variants relevant for this sample) VCF for this specific sample without changing anything else
# bcftools
time bcftools view --samples $SAMPLE --min-ac=1         --no-version --no-update ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz > $SAMPLE.vcf
real    0m4.738s
user    0m4.717s
sys     0m0.017s

# vcftools
time vcftools --gzvcf ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz         --indv $SAMPLE --non-ref-ac-any 1 --recode --recode-INFO-all --out $SAMPLE
real    0m7.505s
user    0m7.491s
sys     0m0.012s

# Compare bcftools and vcftools output
 diff NA18941.vcf NA18941.recode.vcf
<empty>

bcftools is much faster (~50%) and much “smarter” than vcftools. By “smarter” I mean bcftools can directly process vcf.gz without any changes in the code. In vcftoolf, you have to change from --vcf to --gzvcf. Also, the vcftools has default output name $SAMPLE.recode.vcf whereas bcftools has stdout by default (so you can pipe it to bgzip or similar).

Is stop codon part of CDS? Yes, but GTF format is “special”

Recently, I was going through a mix of gene annotation GTFs from different sources and had to fix them. One of the differences was the annotation of stop codons. More specifically, whether the stop codon was part of the annotated CDS (coding sequence).

I had to refresh my biology knowledge and biologically speaking, stop codon is part of CDS. Although it doesn’t code for any amino acid, it is definitely coding for something rather than not. Simple, isn’t it? It would have been if everybody followed the rules, right GTF format?

In GTF format, the stop codon is not part of CDS by definition. In GFF, on the other hand, it is (definition). It can especially be confusing if you convert GFF to GTF (I strongly recommend using AGAT for any GTF to GFF and GFF to GTF conversions even though it’s written in Perl) and the source GFF doesn’t specifically list stop codon lines. In that case, the stop codon will be included in CDS even after GFF to GTF conversion. You’ll have to clip the last three nucleotides from every CDS, the last coding exon, and sometimes the gene and transcript.

If you want to learn more about all the GTFs, GFFs, and other Fs, check out this excellent knowledge hub on the NBIS GitHub page. Another nice discussion is this one on Biostars.

htmlwidgets::saveWidget Error: pandoc document conversion failed with error 97

I always use htmlwidgets::saveWidget to save my Plotly html plots in R. For some unknown reason, I started to get:

Could not find data file templates/file6efb40d38e3c.html                                        
Error: pandoc document conversion failed with error 97                                       
Execution halted   

I first thought I had some problems with the temp dir so I tried to add dir.create(tempdir()) but it wasn’t it. After some more digging I found this GitHub issue and answer from cpsievert, which solved the issue.

Instead of using file=outputdir/filename.html, what I usually use to export/save everything in R, I had to split it into file=filename.html and libdir=outputdir. The error dissapeared and I could happily save my beautiful Plotly html plots.

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: openSUSE Leap 15.0

Matrix products: default
BLAS/LAPACK: /home/joppelt/Miniconda3/envs/riboseq/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] htmlwidgets_1.5.3

loaded via a namespace (and not attached):
[1] compiler_3.6.3    htmltools_0.5.1.1 digest_0.6.27     rlang_0.4.10

bedtools bamtobed doesn’t handle mismatch at read end well

I have PARE-Seq and RIBO-Seq experiments where I really care about the exact position of the first mapped nucleotide. The analysis I do relies on the great bedtools bamtobed. I was just wondering what happens if the aligner, by any chance, reports a mismatch (1X), an insertion (1I), or a softclipping (1S) at the mapped 5′ end.

Example SAM:

$ samtools view in.bam
match	256	NC_001133.9	549	0	26M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneInser	256	NC_001133.9	549	0	1I25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneMism	256	NC_001133.9	549	0	1X25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneSoft	256	NC_001133.9	549	0	1S25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI

will result in:

$ bedtools bamtobed -i in.bam
NC_001133.9     548     574     match   0       +
NC_001133.9     548     573     oneInser        0       +
NC_001133.9     548     574     oneMism        0       +
NC_001133.9     548     573     oneSoft 0       +

Insertion and softclipping are handled properly, and they are not being reported as a match. However, mismatch (1X) at the 5′ end is being reported as a match. It makes sense inside of the read, but it doesn’t make sense at the ends (either 5′ or 3′ mapped end). So be careful and check your mappings.

Saving space with SLOW5/BLOW5 (and POD5) compression of FAST5 files

Edit (01/24/2023): As mentioned in Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing, POD5 is another file format that is supposed to be more efficient than the good ol’ FAST5. It is being developed directly by Nanopore, so you can probably expect to see it more often in the near future. I will add it to the comparison when I find the time to do it.

It took me some time to test the new compression of the bulky FAST5 with slow5tools (Fast nanopore sequencing data analysis with SLOW5). Today, I finally almost ran out of space on our server so it was the right time. And I was pleasantly surprised.

Installing on a regular Linux x86_64 architecture is very simple:

$ VERSION=v0.4.0
$ wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
$ ./slow5tools

I tested it on older (2018) MinION direct RNA FAST5 files with 4k FAST5 files. The original size of the directory is 435M. Compression to BLOW5 (binary version of SLOW5) goes:

$ slow5tools f2s indir -d outdir_blow5

The compressed BLOW5 directory was 210M and took ~1s with the default number of processes (change it with -p INT). That’s ~50% of space saved!

There were some warnings, though:

[search_and_warn::WARNING] slow5tools-v0.4.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.

So it seems the conversion/compression removes some structure from the original FAST5. Let’s see what happened. Let’s convert it back to FAST5 with:

$ slow5tools s2f outdir_blow5 -d indir_decompressed

The decompressed (BLOW5->FAST5) directory has 323M. So it’s not a complete 1:1 conversion. The authors (here, here) say that FAST5 has a lot of wasted space which is removed during the compression. Also, SLOW5/BLOW5 doesn’t store the basecalled sequence (ONT stopped putting the basecalled sequence to FAST5 in their latest MinKNOW anyway – source is the same link) but my testing set didn’t have them to begin with. Also, slow5tools merged all the individual input read into 8 bigger FAST5 files (.

But let’s see what’s changed. We can just simply list the input and the output with HDF5 tools:

$ h5ls indir/DESKTOP_DO3S5TU_20180420_FAH85823_MN27347_sequencing_run_RNA_Control_73142_read_100_ch_110_strand.fast5 # One original read                                                                                                                                   
$ Analyses                 Group
$ PreviousReadInfo         Group
$ Raw                      Group
$ UniqueGlobalKey          Group
$ h5ls indir_fast5/0.fast5 | head -5 # One of the decompressed files
$ read_00c47597-464e-4263-94a1-b8794353d553 Group
$ read_0126ca44-9355-44a2-ac2a-5cb6b29715ff Group
$ read_012cffa1-d978-49f9-8746-91f501379dfd Group
$ read_0264ff46-f1d7-4983-a351-061dfe793b2d Group
$ read_032a1da5-9a9d-418a-831f-d17a3f7c2ca9 Group

We see that some fields from the original FAST5 are missing. This might be a problem if you have some uncommon or additional fields in your FAST5 that you would like to preserve. You will lose them with the compression.

But let’s see how it looks if we basecall the original and the decompressed FAST5 files. Run Guppy (I only have a very old version available 3.3.2+739c7e5):

$ guppy_basecaller \
    -x "auto" \
    --flowcell FLO-MIN106 \
    --kit SQK-RNA002 \
    --hp_correct on \
    --records_per_fastq 0 \
    --u_substitution off \
    --trim_strategy none \
    --input_path indir/ \
    --save_path indir/guppy/ \
    --recursive \
    --compress_fastq \
    --fast5_out

Basecalling of the original FAST5 directory took 12712ms. Basecalling of the decompressed directory took 12043ms. Size of the original basecalled output directory was 576M and of the decompressed 466M. That’s still a ~20% difference.

I then ran a simple fastqc on the basecalled fastq files and they were identical (good!). I am not going to publish here the outputs. You’ll have to trust me (and test it yourself, of course).

To sum it up – slow5tools can save a lot of space if you are like me with 50 MinION runs. But you have to be careful with the lost fields. Also, I haven’t tested it with bulk FAST5 files. I strongly recommend reading the closed issues tab at slow5tools GitHub. A lot of missing things are explained there. If you don’t care about them, BLOW5 is the way to go!

SLOW5 format is not only about compression! The format can also improve the processing of the Nanopore reads/files. Check out the publication for more information.

Differential expression testing of genes with variable expression and/or large test groups

During the regular testing of differential gene expression, the null hypothesis is often that there is no difference between the tested groups (log2 fold-change equals zero). Of course, this is almost never true but we round it up and say they are approximately the same.

However, there are many situations when this doesn’t stand. The simplest one is – The sets are so different that the difference of most genes is different from logFC = 0. This could happen when the tests are very big, highly variable, or the applied treatment is very harsh. Then the question is – what is the level of a difference high enough to be biologically significant? Aka how much the two sets (or genes, for example) have to be different to be really different.

One possible solution is the use of the so-called banded hypothesis test, where the null hypothesis is not exact equality, but that the difference is below a threshold. Wolfgang Huber has a nice Twitter thread about it here, and the cited DESeq2 publication Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (section Hypothesis tests with thresholds on effect size). edgeR has its own implementation of a very similar concept, as highlighted by Davis McCarthy here, and the cited edgeR publication Testing significance relative to a fold-change threshold is a TREAT.

The method, if I dare to extremely simplify it, is based on testing whether the differential expression in an RNA-Seq/microarray experiment is greater than a given (biologically meaningful) threshold. A change should be of sufficient magnitude to be considered biologically significant which helps with small, but somehow consistent changes (for large datasets).

The problem with incorrect null hypothesis might be one of the issues in Exaggerated false positives by popular differential expression methods when analyzing human population samples publication where they evaluated the suitability of edgeR and DESeq2 on testing differential expression in population-sized groups. Of course, in this case, the classic null hypothesis doesn’t make any sense and is broken from the very beginning – most of the genes will not have logFC = 0. If you scroll to the methods and check the code you can see they used the standard edgeR/DESeq2 commands which are not designed to work on such tests. I believe if they used the tests above they would come to completely different conclusions.

The problem with this paper is also summarized in Michael Love’s Tweet highlighting problems with the experiment design and the execution itself.

Changing font in R plots (Arial Narrow)

To my surprise, changing fonts in R plots is not a trivial task. It’s even worse if you are asked to change to font to one of the Windows licensed fonts and you are on Linux. I personally think it’s easier and faster to change the font during the editing but some people don’t know how (or don’t want to know how).

First, I tried to use the extrafont package. Even with fixes for “No FontName. Skipping” , Error in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y”, and trying to install Windows fonts in Ubuntu, I still didn’t get the results.

The best and only solution was to use the showtext package (as found here). You will need to download the font you want. You’ll need the .tff format (.ttc, or.otf). For me, it was Arial Narrow from here.

> # Install showtext package
> install.packages("showtext")
> # Load the library and install the font
> library("showtext") 
> font_add(family = "arialn", regular = "~/Downloads/arialn.ttf")
> # Before plotting, "activate" showtext
> showtext_auto()
> # Do some fancy plots, add theme(text = element_text(family = "arialn")) to all the features where you want to use this font if you are using ggplot2, and save to PDF
> # Deactivate showtext if you no longer need it
> showtext_auto(FALSE)

And that’s it! You’ll have nice plots with the font you want.

Or, you can choose any of the Google free fonts from here. If I didn’t find Arial Narrow I would go with Archivo Narrow. In the case of Google fonts, you don’t need to download anything.

# Get the font from Google fonts
> font_add_google("Archivo Narrow", "archivon")

See more examples in the showtext readme.

But one important thing to remember – this will not store the text as text in the pdf but as curves. If you then import the pdf into a graphics editor, it won’t be easy to edit the text. You would have to convert the curves to text (somehow) and then edit it as text.

Is sambamba faster than samtools?

Update: January 4, 2022: I didn’t set RAM properly and didn’t have variable BAM depth which is fixed now.

samtools and sambamba are probably the most popular SAM/BAM handling tools. And once you are all set with your analysis setup it all goes down to speed. How do they compare in the most common tasks (view, index, sort, sort -n)? And how does it change with multithreading? Here is my simple summary.

I used human transcriptomic BAM (100k, 1M and 2M aligned reads) and the same for human genomic BAM. Everything ran on Ubuntu 16.04 on SSD (read 560MB/s, write 530MB/s) using time command and getting the real time.

reads100k1M2M
commandsamtools | sambambasamtools | sambambasamtools | sambamba
index; 1 thread0m0.107s | 0m0.029s0m0.718s | 0m0.666s0m1.613s | 0m1.423s
index; 12 threads0m0.022s | 0m0.048s0m0.182s | 0m0.203s0m0.347s | 0m0.341s
view; 1 thread0m0.075s | 0m0.069s0m2.815s | 0m0.985s0m5.881s | 0m1.947s
view; 12 threads0m0.077s | 0m0.084s0m0.651s | 0m0.786s0m1.230s | 0m1.432s
sort; 1 thread0m0.165s | 0m1.015s0m6.622s | 0m9.074s0m14.919s | 0m18.542s
sort; 12 threads0m0.184s | 0m0.912s0m1.024s | 0m2.932s0m2.336s | 0m4.160s
sort -n; 1 thread0m0.163s | 0m0.190s0m6.659s | 0m5.268s0m15.683s | 0m12.074s
sort -n; 12 threads0m0.210s | 0m0.081s0m1.087s | 0m0.896s0m2.683s | 0m1.943s
Table 1: samtools and sambamba comparison – transcriptomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

reads100k1M2M
commandsamtools | sambambasamtools | sambambasamtools | sambamba
index; 1 thread0m0.148s | 0m0.083s0m0.983s | 0m0.855s0m1.902s | 0m1.716s
index; 12 threads0m0.024s | 0m0.016s0m0.202s | 0m0.171s0m0.424s | 0m0.334s
view; 1 thread0m0.319s | 0m0.099s0m3.867s | 0m1.126s0m7.615s | 0m2.235s
view; 12 threads0m0.075s | 0m0.072s0m0.663s | 0m0.743s0m1.355s | 0m1.521s
sort; 1 thread0m0.812s | 0m0.815s0m8.434s | 0m8.919s0m17.051s | 0m17.989s
sort; 12 threads0m0.132s | 0m0.123s0m1.522s | 0m1.268s0m2.918s | 0m2.459s
sort -n; 1 thread0m1.013s | 0m0.888s0m11.729s | 0m9.909s0m23.510s | 0m19.239s
sort -n; 12 threads0m0.175s | 0m0.135s0m1.950s | 0m1.328s0m3.833s | 0m2.766s
Table 2: samtools and sambamba comparison – genomic BAM. 100k – 100,000 aligned reads, 1M – 1,000,000 aligned reads, 2M – 2,000,000 aligned reads.

Is there a clear winner? I wouldn’t say so. Depends if you have transcriptomic or genomic BAM, how many threads you use, and what command (or their combination) you use.

sambamba seems to be faster in by name sorting (-n) and in genomic sorting in general. samtools is faster in by coordinates transcriptomic sorting. samtools is also faster in multi-threaded view while sambamba is faster in single-thread view. sambamba is slightly faster in indexing of bigger BAMs while indexing of smaller BAMs is 50/50. See Table 3 and Table 4 for recommendations (based on the largest tested BAM).

transcriptomesortsort -n viewindex
1 threadsamtoolssambambasambambasambamba
12 threadssamtoolssambambasamtoolssambamba
Table 3: samtools and sambamba recommendation – transcriptomic BAM.
genomesortsort -n viewindex
1 threadsambambasambambasambambasambamba
12 threadssambambasambambasamtoolssambamba
Table 4: samtools and sambamba recommendation – genomic BAM.

Note: It is important to realize samtools sort sets RAM per thread while sambamba sort (-m) sets total RAM. By default, samtools uses 768M per thread while sambamba uses 2GB total.

samtools v1.11
sambamba v0.8.0

# sorts
threads=1; input=test.bam; output=foo.bam
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 2G -o $output $input
sambamba sort -t $threads -m 2G -n -o $output $input

threads=12
samtools sort -@ $threads -m 2G $input > $output
samtools sort -@ $threads -m 2G -n $input > $output
sambamba sort -t $threads -m 24G -o $output $input
sambamba sort -t $threads -m 24G -n -o $output $input 

# index and view
threads=1; output=foo.sam
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

threads=12
time samtools index -@ $threads $input > $output
time samtools view -@ $threads -h $input > $output
time sambamba index -t $threads $input > $output
time sambamba view -t $threads -h $input > $output

Cleaning big git repositories

I like to make semi-automatic git repositories for every analysis I am working on. Because they are semi-automatic (=I don’t add file by file but rather exclude files I don’t want to track) it sometimes happens I accidentally add a huge file. And because I then put my semi-automatic git commits to crontab the huge file get buried deep into the git commit-tree structure. If you happen to be lazy like me it’s worth it to check the sizes of your git repos from time to time (or just look to the /var/mail/username for git commit repo size errors ). This is the easiest way to scan&clean repos I found so far.

Don’t waste your time with git filter-branch or git filter-index commands and just go straight for BFG Repo-Cleaner.

Install BFG Repo-Cleaner:

# Get Scala (for bfg)
$ cd ~/tools/
$ wget https://github.com/sbt/sbt/releases/download/v1.5.5/sbt-1.5.5.zip -O sbt-1.5.5.zip
$ unzip sbt-1.5.5.zip
$ mv sbt sbt-1.5.5
$ cd sbt-1.5.5/
$ ln -s $(pwd)/sbt ~/bin/sbt # Make link to your bin

# Install bfg; follow the instructions https://github.com/rtyley/bfg-repo-cleaner/blob/master/BUILD.md
$ cd ~/tools/
$ wget https://github.com/rtyley/bfg-repo-cleaner/archive/refs/tags/v1.14.0.tar.gz -O bfg-repo-cleaner-1.14.0.tar.gz
$ tar xvzf bfg-repo-cleaner-1.14.0.tar.gz
$ cd bfg-repo-cleaner-1.14.0/
$ sbt # <- start the sbt console
sbt: bfg/assembly # <- download dependencies, run the tests, build the jar

# Make link to you bin directory:
# Paste this (without the ">" symbol) into ~/bin/bfg
$ nano ~/bin/bfg
#   > #!/bin/bash
#   > java -jar /home/yourusername/tools/bfg-repo-cleaner-1.14.0/bfg/target/bfg-1.14.0-unknown.jar $@ 

# Test all is working as it should
$ bfg --help

You will also need git object size scanner. The easiest is to use either the main idea from here or use the method from the comment by kenorb (that’s what I am using).

Once you have the BFG and the size scanner and you know it’s working you can start with cleaning. Be extremely careful because the following steps will remove the objects/directories from the git repo completely. Once they are gone they cannot be brought back to life.

# Find big .gits
$ cd ~/
$ find . -type d -name ".git" -prune -exec du -sh {} \;

# Locate the big repo and check what are the big files
$ cd ~/projects/directory-with-huge-git-repo/
$ git big-files

# Check the big files (or directories), add them to .gitignore so they are not added again in the future, and commit the changes
$ nano .gitignore
$ git add .
$ git commit -m "Updated .gitignore before cleaning with BFG"

# Repack blobs for faster cleaning
$ git gc --prune=now --aggressive

And now you are ready for the BFG cleaning. Again, be extremely careful!

# Remove files bigger than x megabytes
$ bfg --strip-blobs-bigger-than 10M .git
# or remove files by name
$ bfg --delete-files test.*.tab .git
# or folders
$ bfg --delete-folders *bckp .git

# Clean and finalize the deletion
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

# Add changes, commit and push to master
$ git add .
$ git commit -m "Commit after cleaning with BFG"
$ git push -f origin master

# Confirm the repo is nice and small
$ du -sh .git