Two of the projects which I am working on contain UMIs in order to remove the PCR duplicates (one is CLIP-Seq, the other one is targeted RNA-Seq, both single-end sequencing).
The way I usually do this is that I first extract UMIs from the reads to a separate file (seqtk), then, using the reads with UMIs removed, I align them to a genome (STAR) and get both genome and transcriptome bams. Then, I add the UMIs as flag (fgbio AnnotateBamWithUmis) and remove or mark duplicates (picard UmiAwareMarkDuplicatesWithMateCigar).
I blindly trusted picard it will remove all the duplicates but then we (my colleague Kamila and I) observed something strange – the duplicates were successfully removed from the genome bams but not from the transcriptome bam. We clearly saw reads with identical mapping positions and identical UMIs (and identical sequences as well) left in the “filtered” bam. We were wondering why this happened.
I tried to extract only a smaller part of the bam and run picard again. Surprisingly, it removed a few additional duplicates but definitely not all of them. I thought it might be because of some strange bam sorting because of the number of transcript references in the bam header. I tried to re-sort the bam a million times but nothing worked. We tried to use a “simple” picard MarkDuplicateswithout UMIs (it should be even more aggressive), but again nothing. I tried to use fgbio GroupReadsByUmibut that one only supports only paired-end sequencing.
Update:fgbio GroupReadsByUminow (from 2020, ehm) supports single-end read as well.
In the end, I ended up with UMI-tools (umi_tools dedup) which works well for both genome and transcriptome bams. Now, I would like to replace the UMI extraction and fgbio bam annotation with umi_tools extract, put the UMI to the reads name and modify the duplicate removal. Unfortunately, umi_tools is extremely slow.
Still, I am not sure why picard didn’t remove the duplicates. My guess would be it has something to do with a lot of references in the bam header (one for each isoform ~200k), which is, for some reason, problematic for picard. But that’s just my guess. I didn’t manage to confirm or disprove this. If someone has some ideas please drop me a message.
P.S. A nice summary of UMIs and how to remove them is here at the UMI-tools.

