Is stop codon part of CDS? Yes, but GTF format is “special”

Recently, I was going through a mix of gene annotation GTFs from different sources and had to fix them. One of the differences was the annotation of stop codons. More specifically, whether the stop codon was part of the annotated CDS (coding sequence).

I had to refresh my biology knowledge and biologically speaking, stop codon is part of CDS. Although it doesn’t code for any amino acid, it is definitely coding for something rather than not. Simple, isn’t it? It would have been if everybody followed the rules, right GTF format?

In GTF format, the stop codon is not part of CDS by definition. In GFF, on the other hand, it is (definition). It can especially be confusing if you convert GFF to GTF (I strongly recommend using AGAT for any GTF to GFF and GFF to GTF conversions even though it’s written in Perl) and the source GFF doesn’t specifically list stop codon lines. In that case, the stop codon will be included in CDS even after GFF to GTF conversion. You’ll have to clip the last three nucleotides from every CDS, the last coding exon, and sometimes the gene and transcript.

If you want to learn more about all the GTFs, GFFs, and other Fs, check out this excellent knowledge hub on the NBIS GitHub page. Another nice discussion is this one on Biostars.

Leave a comment