bedtools bamtobed doesn’t handle mismatch at read end well

I have PARE-Seq and RIBO-Seq experiments where I really care about the exact position of the first mapped nucleotide. The analysis I do relies on the great bedtools bamtobed. I was just wondering what happens if the aligner, by any chance, reports a mismatch (1X), an insertion (1I), or a softclipping (1S) at the mapped 5′ end.

Example SAM:

$ samtools view in.bam
match	256	NC_001133.9	549	0	26M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneInser	256	NC_001133.9	549	0	1I25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneMism	256	NC_001133.9	549	0	1X25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI
oneSoft	256	NC_001133.9	549	0	1S25M	*	0	0	TTGCATCAGCGGTCTATACCCTGTGC	:?@D+4=2=20@<EHIIIIIIIGEHI

will result in:

$ bedtools bamtobed -i in.bam
NC_001133.9     548     574     match   0       +
NC_001133.9     548     573     oneInser        0       +
NC_001133.9     548     574     oneMism        0       +
NC_001133.9     548     573     oneSoft 0       +

Insertion and softclipping are handled properly, and they are not being reported as a match. However, mismatch (1X) at the 5′ end is being reported as a match. It makes sense inside of the read, but it doesn’t make sense at the ends (either 5′ or 3′ mapped end). So be careful and check your mappings.

Alternative to GNU parallel

From time to time I find myself in a situation when I feel a particular step in my analysis could run faster. If the software/command doesn’t support multithreading you have several options how to speed it up. The easiest is to use GNU parallel or xargs. However, GNU parallel is Perl-based which doesn’t always work well in virtual environments and xargs is not very flexible.

After I lost my patients with making GNU parallel work in all my environments I searched for an alternative. The best solution ended up being rush (Wei Shen; GitHub). It’s based on gargs (Brent Pedersen; GitHub) written in Go so you won’t have any problems with compatibility like with GNU parallel.

The user interface is very similar to GNU parallel and so far it has been working very well. It also has some nice features such as resuming failed jobs, setting variables (similar to awk), removing suffixes/replacing strings, etc.

For me personally, the most convenient is still echo of the commands to run in parallel to a file and then running rush on this file (one command per line)

threads=5

> cmds.txt # Empty previous cmds file

# Prepare multiplications of numbers 1-10 by a thousand
for i in {1..10}; do 
    echo -n "
        echo \"$i*1000\" | bc"
done > cmds.txt

# Run commands in parallel with rush and save completed commands
cat cmds.txt | rush '{}' -j $threads --verbose -c -C cmds-done.txt

P.S. I just found out the author of GNU parallel lists a lot of alternatives with detailed description of differences here.

Trim UTRs from exons in GTF

I needed to get ‘clean’ coding exons from a GTF file. In Ensembl, annotated exons are both coding exons and UTRs. Sometimes, they overlap, sometimes they are identical, and sometimes they are completely independent. And example of overlapping exon and UTR is show below.

seqnamesstartendwidthstrandsourcetypescorephasegene_id
1966502966614113+ensembl_havanaexonNANAENSG00000187583
196650296653130+ensembl_havanafive_prime_utrNANAENSG00000187583
1974442975008567+ensembl_havanaexonNANAENSG00000187583
1974576975008433+ensembl_havanathree_prime_utrNANAENSG00000187583

You can see that both 5 and 3 UTRs overlap an exon. Now, in my case I would like to have a heatmap of read coverage over coding part of the gene (meta-gene, to be more precise). If I use the ‘raw’ annotation and get only rows annotated as exons I will get a mixture of coding parts of the exons as well as UTRs and that’s not what I want.

For me, the easiest way how to trim the UTRs is to use R with the great GenomicFeatures, rtracklayer and data.table packages. Assuming you have Ensembl (or GENCODE) GTF (might for for others as well) you can simply load your GTF to R, get coordinates of exons with UTRs trimmed and move on with your work.

library("GenomicFeatures")

txdb = makeTxDbFromGFF("in.gtf", format = "gtf") # make db
cds <- cdsBy(txdb, by="tx", use.names=T) # get only coding part of the exons = remove UTRs

GenomicFeatures, however, doesn’t preserve the other GTF fields which I personally like. I will therefore load GTF once more and merge it with the coding exons to keep all the extra information.

library("data.table")
library("rtracklayer")

gtf <- rtracklayer::import("in.gtf", format="gtf")
gtf.dt <- as.data.table(gtf)
gtf.dt.exon <- gtf.dt[type=="exon"] # get only exon

cds.dt <- as.data.table(cds)
cds.dt <- cds.dt[, c("group_name", "start", "end", "exon_rank")] # keep only columns we need

gtf.dt.exon[, new:=paste(transcript_id, exon_number, sep=".")] # make replacement id
cds.dt[, new:=paste(group_name, exon_rank, sep=".")] # make replacement id

gtf.dt.exon <- merge(gtf.dt.exon, cds.dt, by = "new", all.x = T) # merge the tables keeping all.x to be able to debug and check
      
gtf.dt.exon <- na.omit(gtf.dt.exon, cols="group_name") # remove exons which do not have an alternative in coding exons
gtf.dt.exon[, start.x := start.y] # update start/end by the cds
gtf.dt.exon[, end.x := end.y]
setnames(gtf.dt.exon, c("start.x", "end.x"), c("start", "end")) # rename
gtf.dt.exon[, c("group_name", "start.y", "end.y", "exon_rank", "new") := NULL] # clean

gtf.dt.exon[, type:="coding_exon"] # change exon column to coding_exon

gtf.out <- rbindlist(list(gtf.dt.exon, gtf.dt)) # add new lines to the original gtf
gtf.out <- gtf.out[order(seqnames, start)] # sort

gtf.out.granges <- makeGRangesFromDataFrame(gtf.out, keep.extra.columns=TRUE, ignore.strand=FALSE) # make the granges for easier exprort
rtracklayer::export(gtf.out.granges, "out.gtf") # export final gtf

And here you go – nice and clean coding part of the exons (column 3 = coding_exon) exported together with the original GTF

seqnamesstartendwidthstrandsourcetypescorephasegene_id
1966502966614113+ensembl_havanaexonNANAENSG00000187583
196650296653130+ensembl_havanafive_prime_utrNANAENSG00000187583
1966532966614113+ensembl_havanacoding_exonNANAENSG00000187583
1974442974575567+ensembl_havanacoding_exonNANAENSG00000187583
1974442975008567+ensembl_havanaexonNANAENSG00000187583
1974576975008433+ensembl_havanathree_prime_utrNANAENSG00000187583

I have then tested this modified GTF in DeepTools (3.5.0) and it works just fine. DeepTools love to complain about GTF is they see something fishy (…nice joke for those who know the logo of DeepTools).

IGV batch snapshot from command line

Right now I am processing a lot of public sequencing datasets from SRA. As always in this situation, a proper QC is necessary to unravel all the adaptor sequences, UMIs, … especially for datasets where we care about precise 5’/3′ ends. In my case, it’s Ribo-Seq and we want to play with 5′ ends in a big detail.

For a long time, my QC pipeline was missing a detailed alignment check. I mean really seing the alignment itself in bases, not some summary statistics. This would help me to see extensive and repetetive softclipping, mismatches, etc. which can point out some preprocessing error. Unless I want to open each bam in IGV I needed to think about something more automated. I knew IGV has a batch mode but I never went into a detail.

To make IGV snapshots automatic and you don’t want to open IGV GUI for all the bams you need to get xvfb-run first. xvfb-run will create a fake X11 which will allow you to launch IGV from command line without opening it’s GUI (inspiration from here which links you to IGV Snapshot Automator tool). Once you have this you can start the fun. It’s possible to it without the xvfb-run but the script would open IGV GUI everytime you try to make a snapshot which is not always possible (for example some ssh connections don’t allow X11 export).

Then, you have to make your batch (~config) file which will instruct IGV what to do. A simple example is available at IGV. I would recommend two snapshots targeted to some region of a housekeeping gene(s) – one which captures a exon-exon boundary to see a splicing event and one with a more detailed zoomed look to see the alignment at a single base resolution. For human with Ensembl genome I am using this one:

genome /home/jan/data/hg38/genome/genome.fa
load /home/jan/projects/test/samples/test/test.bam
snapshotDirectory //home/jan/projects/test/samples/test/Screeshots/
preference SAM.SHOW_SOFT_CLIPPED true
goto 12:112022058-112022719
sort position
snapshot splice.png
goto 12:112022150-112022250
sort position
snapshot zoom.png
exit

It instructs IGV to load user reference genome (can be fasta), load my bam, set snapshot directory, load temporary preferences and then to to a specific position, take snapshot and save it with a specific name, go to another location and do the same, and exit the IGV.

The IGV manual doesn’t say much about the preference option nor they list available options (or I haven’t found them). They help you to adjust behaviour of the IGV in the same way as you do in the Preferences tab in IGV GUI. The easiest way to see the possible settings is to read $HOME/igv/prefs.properties. This lists all the preferences you have change manually in the IGV GUI. Then you just have to guess what is what. Some options are listed at the IGV GitHub but the easiest is to copy your favourite preferences and just use those.

The next trick is to change the regular java execution with the xvfb-run. For example, you can just copy-paste the regular IGV launch script at IGV/igv.sh and replace line

exec java -showversion --module-path="${prefix}/lib" -Xmx16g \

for line

xvfb-run --auto-servernum --server-num=1 java -showversion --module-path="${prefix}/lib" -Xmx16g \

And the last step is to launch the IGV itself. Let’s say you have saved the batch/config file to igv-batch.config, and the modified launch script to igv-batch.sh and you can just execute:

sh igv-batch.sh -b igv-batch.config

Now, you can enjoy your beautiful IGV snaphots.

Part of a slice junction with general overview. You can see there is some softclipping at the beginnign and the end of the reads.
After zoom in to a single base resolution we can see there are 4 additional nucleotides at both ends which need to be trimmed (or clipped) away.

Minimap2 transcriptome mapping

I like using Minimap2 for our Nanopore RNA sequencing a lot. I tried deSALT and and GraphMap but Minimap2 seems to work better for noisy direct RNA-Seq ONT data.

I usually align the reads to both genome and transcriptome reference. However, the transcriptome mapping didn’t behave as I expected. For the ONT RNA-Seq data the Minimap2 manual led me to use:

minimap2 -a -x map-ont -k 12 -p 1 -u f --secondary=yes ref.fa.mmi reads.fastq.gz

The map-ont doesn’t allow to splice the reads, -k 12 helps with the read noise, -p 1 outputs only same scoring alignments (one as primary, rest to secondary mappings) and --secondary=yes allows to output secondary alignments. The highlighted parameter -u f should force forward strand only mapping. Again, from the manual: For Iso-seq, Direct RNA-seq and tranditional full-length cDNAs, it would be desired to apply -u f to force minimap2 to consider the forward transcript strand only. Therefore I was expecting to see forward strand mapping only since I used transcriptome fasta as a reference. However, this was not a case. A simple scan with samtools view -f 16 showed a reasonable number of mappings. What?! I thought it’s some kind of feature of the data and that Minimap2 is doing everything correctly but somehow the reads mapped to the reverse strand do not map anywhere else so it puts them to the reverse strand as a last resort. This all changed with our latest spike-in only (SIRV) experiment.

In this case I knew there should not be any reverse strand mapping whatsoever. But to my surprise, the command above gave me quite a lot of reverse strand mapping. So I dug a bit deeper on where it the problem and I found it. By accident I found a Ubuntu Minimap2 manual (later on I also found the official one here) where it says:

-u CHAR How to find canonical splicing sites GT-AG - f: transcript strand; b: both strands; n: no attempt to match GT-AG [n]

So in reality, -u f has nothing do to with forward strand mapping only. Or at least not in the way I would expect it. It basically looks for predefined splice sites but only in the same direction as is the read. So it makes sense only in genomic spliced alignments, nowhere else. Whoops. The parameter I was looking for turned out to be --for-only which again, by the Ubuntu manual says:

--for-only Only map to the forward strand of the reference sequences. For paired-end reads in the forward-reverse orientation, the first read is mapped to forward strand of the reference and the second read to the reverse stand.

After removing the -u f and replacing it with --for-only I finally got the alignments I was hoping for and absolutely no alignments on the reverse strand. Note: There is also --rev-only option. It should be possible to find this option in the command-line Minimap2 manual if you use:

man ./minimap2.1' for detailed description of these and other advanced command-line options. 

but this never worked for me since I install Minimap2 using Conda. On the other hand, I didn’t expect the brief manual wouldn’t mention it. You will also find hardly anything at the Minimap2 GitHub.

Obviously, this is not an issue of Minimap2 because it is in the manual after all. But, you have to really look for the manual in order to find it which I think it’s not the best. Hopefully this will help you to figure out why you are not getting your expected alignments (and you are the same ignorant as me).

samtools depth doesn’t count secondary mappings (samtools<1.10)

Recently, I have fallen in love with samtools depth. It’s fast, easy to use and does exactly what I want it to do – get coverage of individual bases in the BAM file.

After some experiment and figuring out why the heck I am missing some covered regions I have found out that default behavior of is NOT to count any of the following: secondary mapping, qc-fail and read duplicates.

Luckily, in the new version of samtools (>=1.10) you can allow counting of the mappings/reads by setting -g and then either list them separated by comma or using hex or decimal numbers.

For example to get all the regions and count secondary, qc-fail and duplicated mappings/read:

samtools-1.10 depth -a -d 0 -g SECONDARY,QCFAIL,DUP in.bam

Note: Brent Pedersen wrote mosdepth (publication here) which should be faster than samtools depth especially for large mappings (such as genome). From the GitHub: “per-base depth about 2x as fast samtools depth–about 25 minutes of CPU time for a 30X genome.“.

Edit: I have tested very briefly tested mosdepth and it doesn’t seem to be that much faster. I would even say samtools depth is faster. The testing set was transcriptome BAM file.

Convert gene symbols to UCSC id

This is a never ending story – converting one type of gene id to another. We have UCSC, RefSeq, Ensembl, and who know what other types of IDs.

To convert gene symbols (~gene names) to UCSC ID (knownGene) we can use many of the R tools or just use the strong mysql possibilities of UCSC.

To get gene symbols to UCSC IDs using MySQL command do:

assembly='mm9' ## Species and assembly you want to query
ucsc_ids="'uc007aet.1', 'uc007aeu.1', 'uc007aev.1'" ## IDs to convert
mysql --user=genome --host=genome-mysql.cse.ucsc.edu --disable-auto-rehash -e "SELECT DISTINCT knownGene.name, kgXref.geneSymbol FROM $assembly.knownGene INNER JOIN $assembly.kgXref ON knownGene.name = kgXref.kgID WHERE knownGene.name IN ($ucsc_ids);" > ucsc2symbols.txt

The answer was originaly posted by dariober at seqanswers.

Useful one-liners collection

Update: To see other one-liners, please check out my gist collection. It’s easier to make a small gist rather than write a “long” post here.

Finally, I have decided to put together a collection of useful one-liners before I forget them all. I will start slowly and add more on-the-fly. It will be a mixture of Bash, R, Perl, Python, and who knows what else.

Command line to remap a position sorted bam (from Heng Li tweet)

samtools collate -uOn128 old-pos-srt.bam tmpxyz | samtools fastq - | bwa mem -pt16 ref.fa - | samtools sort --threads=4 -m4G -o new-pos-srt.bam

The -uOn128 stands for -u uncompressed BAM output, -O output to stdout and -n INT number of temporary files [128]. For the bwa mem command the -pt16stands for -p first query file consists of interleaved paired-end sequences and -t INT number of threads [16]. So be careful as this assumes you have paired-end reads.

You might have to specify TMPDIR if your BAM is big or if you have access to a quick SSD disk. samtools fastq can also copy RG tags if you add samtools fastq -t copy RG, BC and QT tags to the FASTQ header line and/or samtools fastq -T TAGLIST copy arbitrary tags to the FASTQ header line.

Note for the bwa mem – there is a new version coming up called bwa-mem2, which should produce the same alignments as bwa mem but should be ~80% fasters. Right now (May 30, 2019), it doesn’t have a stable release yet, but if you like bwa mem it should be something to keep your eye on.

Get the number of reads (mapped) from BAM file

I always forget this and I always have to remind myself how to do this simple command so I rather put it down. If you want to get the number of reads from the BAM file, you cannot simply use samtools view -F 4 in.bam | wc -l or samtools flagstat in.bam because that would give you the number of mappings and not the number of mapped reads. To get the real number of mapped reads, you would have to do (taken from here):

samtools view -F 4 in.bam | cut -f 1 | sort | uniq | wc -l

This will work for both paired-end and single-end reads. If you have a paired-end and you would like to get only reads where both ends are mapped, you would have to replace -F 4 for -F 12.

Update: I have recently found that there is an even easier and, most importantly, faster solution (source here). You can use:

samtools view -F 4 in.bam | cut -f 1 | awk '!x[$0]++' | wc -l

Which makes it much faster. Compared to the sort | uniq this is almost 50% faster.

On BAM file with 149,909,118 input reads sort | uniq takes:
real 1m8.240s
user 2m57.097s
sys 0m18.001s


while awk '!x[$0]++' takes:
real 0m42.857s
user 2m30.028s
sys 0m18.179s

Replace multiple spaces (>=1) using sed

Countless times I needed to use my favorite sed to replace multiple spaces with tab. And countless times, I forgot how I did it the last time.

To keep it in my external memory, here we go:

text="hello     world"
echo "$text" | sed 's/  */X/g'

results in:

helloXworld

Please note there are two spaces if you use it with *. With one space it will much every symbol.

I often use it in conjunction uniq -c | sed 's/ *\t/g' | sed 's/^\t//g' to get a nicely formatted (=tab separated in my case) list of counts:

text=" 2   a\n1  c\n3 b"
echo -e "$text" | sed 's/  */\t/g' | sed 's/^\t//g' | sort -k1n
#1       c
#2       a
#3       b

The original post is here.

Note: If you don’t quote $text, echo will strip multiple spaces to just one.

text="hello     world"
echo $text
#hello world
echo "$text"
#hello     world

Insert line to the beginning of the file (first line)

Another handy sed command is to insert a line to the beginning of the file. I use it quite often to add a header to a table.

To add the line, just do:

echo -e "a\nb" | sed '1i transcript_id\tlength' 

You can also combine it with replacing the first line:

echo -e "a\nb" | tail -n+2 | sed '1i transcript_id\tlength'

Inspiration was found here.

Making correct RNA references

Often, I need to have a nice RNA reference for alignments or similar tasks. What sometimes happens is that the reference sequence you download from a database (for example hairpin.fa from miRBase, or tRNAs from GtRNAdb) contains the real RNA sequence – contains uracils and not thymines. By definition, this is a correct RNA sequence. But, if you want to do an alignment the same way you are used to do alignments to a genome reference you might get some unexpected results. They are often caused by Us in the sequnce. You will get some alignment from sequences that have fewer Us than number of mismatches but definitely the alignment results will be biased. My rule no. 1 when I start to work with any RNA-based reference is to make sure they have Ts and not Us and I also like to have them in one-line fasta format.

For example for miRBase miRNA hairpin reference:

$ wget ftp://mirbase.org/pub/mirbase/21/hairpin.fa.gz # Get the sequence
$ zcat hairpin.fa.gz | perl -pe '/^>/ ? print "\n" : chomp' | tail -n +2 | sed '/^[^>]/ y/uU/tT/' > hairpin.oneline.fasta # Convert to one-line fasta and replace all Us for Ts
$ grep -A1 "^>hsa" hairpin.oneline.fasta | sed '/^--$/d' > hsa_hairpin.fasta # Get only human hairpin sequences

Removing UMI duplicates from transcriptome alignment – more on Picard

This a follow up on my previous post. I was still wondering why Picard doesn’t remove duplicates from my RNA-Seq genome or transcriptome bam. I still haven’t sorted out the main problem but one of the issues I have “discovered” is that Picard ignores not primary alignment (SAM flag 256). When sorted by coordinate (classic samtools sort) it looks only for primary alignments (to be more correct – no not primary alignments) and removes duplicates from there. If you use only primary alignments, for example in variant calling, you would be fine. But if you would like to use multi-mapped reads, for example in RNA-Seq analysis using RSEM, you might not be getting the deduplicated bam file as you would like – the duplicates “formed” by secondary alignments would be still present.

If you want to remove duplicates from secondary alignments as well you have to sort the bam by queryname (=read name). They actually say this in the Picard MarkDuplicates manual but you can easily miss this. This is what they say: 

The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

Picard MarkDuplicates

When I tried to re-sort the bam file by the queryname using samtools sort -n I got an error from Picard saying “bam file not sorted properly by queryname” (or something like that. Well, the reason is samtools sort -n uses by default natural sort but Picard sort by queryname requires sorting by ASCII. You can achieve this by samtools sort -n as well but you have to add samtools view file.bam | env LC_ALL=C sort | samtools view -b > file-sorted.bam (as suggested here) and it might just work. Or simply use Picard SortSam SORT_ORDER=queryname and you are fine. Then, in Picard MarkDup add ASSUME_SORT_ORDER=queryname and it SHOULD remove the duplicates from the secondary alignments as well. The problem is that it still doesn’t. I can still see a lot of alignments which are duplicates but remain in the bam file… But it’s slightly better and I see less obvious duplicates kept in the bam file than before.  

Mystery of the Picard MarkDuplicates is still not solved and it just might not be suitable for RNA-Seq alignments at all.