bedtools merge book-end merge

I never paid too much attention to this (I should have), but TIL that bedtools merge default is not only to merge overlapping regions but also to merge book-ended regions.

From the bedtools merge manual: -d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

What does this mean? With the default (-d 0) settings, bedtools merge will also merge the following regions:

# Two regions with book-end relationship - they are next to each other like books:
12345678
AAAA
BBBB
# Will get merged with the default bedtools merge -d 0 settings into:
12345678
AAAABBBB

To avoid merging book-ended regions, you have to change the default to -d -1. The example above will not get merged anymore and bedtools merge will merge only truly overlapping regions.

Note: bedtools merge is smart enough to correctly merge VCF files with END info field. This is useful, for example, when working with VCFs with structural variants, such as CNVs. You don’t have to convert the VCF into BED, merge, and convert back to VCFs.

Is stop codon part of CDS? Yes, but GTF format is “special”

Recently, I was going through a mix of gene annotation GTFs from different sources and had to fix them. One of the differences was the annotation of stop codons. More specifically, whether the stop codon was part of the annotated CDS (coding sequence).

I had to refresh my biology knowledge and biologically speaking, stop codon is part of CDS. Although it doesn’t code for any amino acid, it is definitely coding for something rather than not. Simple, isn’t it? It would have been if everybody followed the rules, right GTF format?

In GTF format, the stop codon is not part of CDS by definition. In GFF, on the other hand, it is (definition). It can especially be confusing if you convert GFF to GTF (I strongly recommend using AGAT for any GTF to GFF and GFF to GTF conversions even though it’s written in Perl) and the source GFF doesn’t specifically list stop codon lines. In that case, the stop codon will be included in CDS even after GFF to GTF conversion. You’ll have to clip the last three nucleotides from every CDS, the last coding exon, and sometimes the gene and transcript.

If you want to learn more about all the GTFs, GFFs, and other Fs, check out this excellent knowledge hub on the NBIS GitHub page. Another nice discussion is this one on Biostars.