Reference genome patches

I always wondered whether the genome patches (NCBI, Ensembl) are actually useful for something. To be honest, I always used the primary assembly to avoid “problems” with haplotype information (also described by Mark Ziemann here or in this question at Biostars, additional information here by Heng Li) but I have never looked at the patches up until now.

Recently my colleague asked me to locate part of a mRNA isoform which she cannot find in the reference genome (orange part). However, she can find the first part of the sequence (green part) in the reference genome (hg38) and the full length in the reference transcripts.

>mysterious_sequence
TATTTAGCCGCCAAGTTGGATAAAAATCCAAATCAGGTCTCAGAAAGATTCCAGCAGCTAATGAAGCTCTTTGAAAAGTCAAAATGCAGATAAGTT

So I BLASTed the sequence at the NCBI and indeed there is a full match with mRNA isoforms (Figure 1). But if I tried to locate the mysterious orange part in the reference genome I didn’t get anything.

transcripts
Figure 1: BLAST to transcript variants.

Then, I looked a bit lower to other hits and viola – the orange part was successfully BLASTed to p12 HG2121_PATCH patch of the reference genome (Figure 2). At the same time, it gave me a hit to the primary assembly where I can see the same “feature” as she saw initially – orange part of the sequence was missing.

genome
Figure 2: BLAST to the genome patch (top) and primary assembly (bottom).

The mystery was solved and I can start to think about including the patches to my favourite reference genome. Of course, I assume there will be a lot of other issues but the good thing is the patches shouldn’t change the chromosome coordinates.

Leave a comment