Removing UMI duplicates from transcriptome alignment

Two of the projects which I am working on contain UMIs in order to remove the PCR duplicates (one is CLIP-Seq, the other one is targeted RNA-Seq, both single-end sequencing).

The way I usually do this is that I first extract UMIs from the reads to a separate file (seqtk), then, using the reads with UMIs removed, I align them to a genome (STAR) and get both genome and transcriptome bams. Then, I add the UMIs as flag (fgbio AnnotateBamWithUmis) and remove or mark duplicates (picard UmiAwareMarkDuplicatesWithMateCigar).

I blindly trusted picard it will remove all the duplicates but then we (my colleague Kamila and I) observed something strange – the duplicates were successfully removed from the genome bams but not from the transcriptome bam. We clearly saw reads with identical mapping positions and identical UMIs (and identical sequences as well) left in the “filtered” bam. We were wondering why this happened.

I tried to extract only a smaller part of the bam and run picard again. Surprisingly, it removed a few additional duplicates but definitely not all of them. I thought it might be because of some strange bam sorting because of the number of transcript references in the bam header. I tried to re-sort the bam a million times but nothing worked. We tried to use a “simple” picard MarkDuplicateswithout UMIs (it should be even more aggressive), but again nothing. I tried to use fgbio GroupReadsByUmibut that one only supports only paired-end sequencing.

Update:fgbio GroupReadsByUminow (from 2020, ehm) supports single-end read as well.

In the end, I ended up with UMI-tools (umi_tools dedup) which works well for both genome and transcriptome bams. Now, I would like to replace the UMI extraction and fgbio bam annotation with umi_tools extract, put the UMI to the reads name and modify the duplicate removal. Unfortunately, umi_tools is extremely slow.

Still, I am not sure why picard didn’t remove the duplicates. My guess would be it has something to do with a lot of references in the bam header (one for each isoform ~200k), which is, for some reason, problematic for picard. But that’s just my guess. I didn’t manage to confirm or disprove this. If someone has some ideas please drop me a message.

P.S. A nice summary of UMIs and how to remove them is here at the UMI-tools.

Reference genome patches

I always wondered whether the genome patches (NCBI, Ensembl) are actually useful for something. To be honest, I always used the primary assembly to avoid “problems” with haplotype information (also described by Mark Ziemann here or in this question at Biostars, additional information here by Heng Li) but I have never looked at the patches up until now.

Recently my colleague asked me to locate part of a mRNA isoform which she cannot find in the reference genome (orange part). However, she can find the first part of the sequence (green part) in the reference genome (hg38) and the full length in the reference transcripts.

>mysterious_sequence
TATTTAGCCGCCAAGTTGGATAAAAATCCAAATCAGGTCTCAGAAAGATTCCAGCAGCTAATGAAGCTCTTTGAAAAGTCAAAATGCAGATAAGTT

So I BLASTed the sequence at the NCBI and indeed there is a full match with mRNA isoforms (Figure 1). But if I tried to locate the mysterious orange part in the reference genome I didn’t get anything.

transcripts
Figure 1: BLAST to transcript variants.

Then, I looked a bit lower to other hits and viola – the orange part was successfully BLASTed to p12 HG2121_PATCH patch of the reference genome (Figure 2). At the same time, it gave me a hit to the primary assembly where I can see the same “feature” as she saw initially – orange part of the sequence was missing.

genome
Figure 2: BLAST to the genome patch (top) and primary assembly (bottom).

The mystery was solved and I can start to think about including the patches to my favourite reference genome. Of course, I assume there will be a lot of other issues but the good thing is the patches shouldn’t change the chromosome coordinates.

Disclaimer

During the time as I have collected my notes and hints, I always tried to save as much used resource as possible. I am sure I have missed some. If you spot any of your figures, tables or text just let me know and I will add you to the used resources. I am not trying to steal anything from anyone but rather collect some useful information.

Also, some of the notes and hints I will transfer to this blog might be already outdated. I will try to update them as much as possible but sometimes (but really only sometimes) I am lazy and I will just put the things here as I have them in my beautiful .odt notebook.

The Journey Begins

After a long time storing all my notes and hints in an ugly text file I have decided to give them a nice look and open them for everyone. The blog will contain my notes, ideas, and summaries mainly from bioinformatics which are highly subjective and might not, and probably are not, always correct.

A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it. — Albert Einstein.