Alternative to GNU parallel

From time to time I find myself in a situation when I feel a particular step in my analysis could run faster. If the software/command doesn’t support multithreading you have several options how to speed it up. The easiest is to use GNU parallel or xargs. However, GNU parallel is Perl-based which doesn’t always work well in virtual environments and xargs is not very flexible.

After I lost my patients with making GNU parallel work in all my environments I searched for an alternative. The best solution ended up being rush (Wei Shen; GitHub). It’s based on gargs (Brent Pedersen; GitHub) written in Go so you won’t have any problems with compatibility like with GNU parallel.

The user interface is very similar to GNU parallel and so far it has been working very well. It also has some nice features such as resuming failed jobs, setting variables (similar to awk), removing suffixes/replacing strings, etc.

For me personally, the most convenient is still echo of the commands to run in parallel to a file and then running rush on this file (one command per line)

threads=5

> cmds.txt # Empty previous cmds file

# Prepare multiplications of numbers 1-10 by a thousand
for i in {1..10}; do 
    echo -n "
        echo \"$i*1000\" | bc"
done > cmds.txt

# Run commands in parallel with rush and save completed commands
cat cmds.txt | rush '{}' -j $threads --verbose -c -C cmds-done.txt

P.S. I just found out the author of GNU parallel lists a lot of alternatives with detailed description of differences here.

Speed up grep search

I am a simple man and I like simple solutions. Whenever I need to subset a file which doesn’t have nice column structure I use grep. Often, I need to search for a list of strings. To be more specific – subset gtf file by a list of transcript ids.

A regular grep -f transcripts.txt annot.gtf is very slow (transcripts.txt contains the list of strings to search for in annot.gtf). There are several approaches on how to speed up the search. One approach is to split transcripts.txt or annot.gtf and then run for loop. Another one is to run xargs (same link). However, this doesn’t work well with gtf. For some reason if was splitting individual lines creating non-sense results. Next option is to use alternative grep implementations, such as ripgrep. But I wanted to stay with the regular grep. The option I ended up with is a combination of grep and parallel. Since I search only for ASCII string and not a pattern I can add export LC_ALL=C and -w -F.

threads=6

export LC_ALL=C
cat transcripts.txt | parallel -j $threads grep -w -F {} annot.gtf > annot.sub.gtf

Of course, you can combine splitting files (split -l) with parallel or similar. I haven’t done such extensive testing but I would assume the I/O would be the limit.

Speed up deepTools computeMatrix

I like using deepTools in my analyses but it might get very slow if you work with big(er) files. Yes, it offers multithreading, but that often doesn’t help and using a lot of threads already killed our server once (it wasn’t me, I swear).

One of the slowest part I have encountered is the computeMatrix module. If your references (-R) are just a couple hundred lines, it will get veeeery slow. After a few days of frustration and waiting to get the matrix I have decided this is not the way to go.

It seems computeMatrix doesn’t scale well with increasing number of reference positions. It turned out to be much more effient to subsample the references, calculate matrices and then merge them together. A simple table of the runtimes looks like this:

Number of positions in -R	Time / seconds
1,000	4
5,000	20
10,000	58
25,000	275

Fig 1. Runtime of deepTools computeMatrix on different number of reference (-R) positions. The original reference file was randomly sampled 3-times (shuf -l number_of_reads) for each number of reads and the runtimes were averaged.

It seems, that in this experiment, the sweet spot is somewhere around 5,000 positions but I think it might be different in every experiment. Since my references have around 100,000-150,000 positions splitting them to 5,000 chunks doesn’t overload the system much.

This is the final code looks like this:

	#!/bin/bash
	#
	# Speed up deepTools computeMatrix by splitting the references into smaller chunks and then merging the matrices together
	#

	positions=5000
	threads=12

	rnd=$RANDOM

	# split reference into chunks by number of lines
	split -l $positions ref.bed ref.chunks${rnd}

	for chunk in ref.chunks${rnd}*; do
	# Rename name column (4) in bed to avoid potential problems which deepTools naming which might happen if the reference position name are not unique
	name=$(basename $chunk)
	name=${name##*.}

	cat $chunk \| awk -v name=$name 'BEGIN {FS = "\t"; OFS = "\t"} {print $1,$2,$3,name,$5,$6}' > tmp.$rnd && mv tmp.$rnd $chunk
	done

	# calculate matrix for each chunk
	for chunk in ref.chunks${rnd}*; do
	computeMatrix reference-point \
	–referencePoint TSS \
	-R $chunk \
	-S input.bw \
	-b 500 -a 500 \
	–skipZeros \
	–missingDataAsZero \
	–binSize 10 \
	–averageTypeBins median \
	–numberOfProcessors $threads \
	–outFileName ${chunk}.gz
	done

	# merge the chunks back to one file
	computeMatrixOperations rbind -m ref.chunks${rnd}.gz -o ref.matrix.gz && rm ref.chunks${rnd}.gz

	# make heatmaps
	plotHeatmap \
	-m ref.matrix.gz \
	–sortUsing mean \
	–averageTypeSummaryPlot mean \
	–missingDataColor "#440154" \
	–colorMap viridis \
	–zMax 100 \
	–linesAtTickMarks \
	–refPointLabel "TSS" \
	–heatmapHeight 20 \
	–heatmapWidth 10 \
	–dpi 300 \
	–outFileName ref.png

	rm ref.chunks${rnd}*

view raw deeptools-heatmap-fast.sh hosted with ❤ by GitHub

Strikethrough text in R plots

I wanted to add a strikethrough text to an R ggplot2 plot and I thought I will have to play with font settings. But, I have found this nice hint that you don’t actually have to do this. All you need is to change the encoding of the text (labels, in my case) and ggplot2 will happily do that for you. The trick is the following:

> strk <- stringr::str_replace_all("strikethrough", "(?<=.)", "\u0336")
> strk
[1] "s̶t̶r̶i̶k̶e̶t̶h̶r̶o̶u̶g̶h̶"

In reality it looks much nicer than this stupid copy-paste you see above:

Remove common part of multiple strings in R

In my work, I often get a long list of samples names which are way too long. And we all know plots don’t like long sample names. The easiest is to remove the common part of the sample names and create a shorter version. By removing the common part you only keep the unique part of the sample name which can still be used to identify the samples.

# Make a function which splits the input vector of strings (name_vect) by a separator (sepa) and returns only unique strings separated by the same separator
rename_samples <- function(name_vect, sepa) {
  samp_names.tmp <- t(as.data.frame(strsplit(name_vect, sepa, fixed = T)))
  ind <- lapply(apply(samp_names.tmp, 2, unique), length) != 1
  samp_names.tmp <- as.data.frame(samp_names.tmp[, ind])
  samp_names <- as.character(interaction(samp_names.tmp, sep = sepa))
  return(samp_names)
}

# Prepare a vector of long strings which have common parts which can be stripped
samples <- c("ath.RNASeq.MiwiKO.P30.1",
"ath.RNASeq.MiwiKO.P30.2",
"ath.RNASeq.MiwiKO.P20.1",
"ath.RNASeq.MiwiWT.P30.1",
"ath.RNASeq.MiwiWT.P30.2",
"ath.RNASeq.MiwiWT.P20.1")

# Apply the function
rename_samples(samples, ".")
[1] "MiwiKO.P30.1" "MiwiKO.P30.2" "MiwiKO.P20.1" "MiwiWT.P30.1" "MiwiWT.P30.2" "MiwiWT.P20.1"

And voila – the initial long sample names were shortened and made much nicer and cleaner.

Note: this will work only if the samples names have the same naming style. This means they have to have the same number of blocks separate by sepa character.

stdin and stdout decision in Python

The easiest automatic decision between defined input file or stdin and output file or stdout I have come up with in Python is:

import sys

# If intab (input) is defined, read it; if not assume stdin
if intab:
    f = open(intab, "r")
else:
    f = sys.stdin

# If otab (output) is defined, read it; if not assume stdout
if otab:
    fout = open(otab, "w")
else:
    fout = sys.stdout

# Do whatever you want 
for line in f:
    print(line)
    write.fout(line)

# And close files, just to be safe
if f is not sys.stdin:
    f.close()

if fout is not sys.stdout:
    fout.close()

There might be something easier but this is simple and obvious.

Count number of mappings (get unique and multi mappings)

I have always strugled with unique mapping filtering. This is mainly because every aligner/mapper marks unique mappings differently. Some aligners assign a specific MAPQ (STAR – 255), some assign a SAM tag (bwa aln – XT:A:U), some just say if the alignment is likely to be unique or not (bwa mem – MAPQ 0 = definitely not unique, MAPQ > 0 more likely unique as the MAPQ increases).

Although many people do not support the concept of unique alignment I am a simple man and I like to know the number of mappings per reads. Especially with the long-read sequencing where we might want to exclude reads which have a strong support to be mapped to a single position from those that could be mapped to multiple regions.

To make this simple and aligner/mapped independent the solution is quite simple – count how many time we see a single read (by name) mapped in a SAM file. To solve this I made a very simple Python script which can do this counting and adds the number of alignments in a new SAM tag.

sam-counts-maps.py (very beta version) counts number of alignments per read name. So far it cannot handle paired-end reads and chimeric reads but I would love to extend it soon.

The use is rather simple:

$ samtools view -h test.bam | python sam-counts-maps.py --tag X0 | samtools view -bh - | samtools sort - > out.bam

The script cannot handle BAM files because I didn’t want to use pysam to make it more compatiable (and I don’t like pysam to read/write SAM/BAM because it likes to modify the files randomly).

The full options are:

usage: sam-count-maps.py [-h] [-i INPUT] [-o OUTPUT] [-t TAG] [-w]

Count number of mappings per read.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input SAM file. Default: stdin
  -o OUTPUT, --output OUTPUT
                        Output SAM file. Default: stdout
  -t TAG, --tag TAG     New tag with number of mappings to be added. Default: X0.
  -w, --overwrite       Do you want to overwrite if --tag already exists?
                        Default: Don't overwrite.

Then, you can just grep for number of alignments with the added tag. For example, to get unique alignments:

samtools view -h out.bam | grep -w -E "^@[A-Z][A-Z]|X0:i:1" | samtools view -bh - > unique.bam

Trim UTRs from exons in GTF

I needed to get ‘clean’ coding exons from a GTF file. In Ensembl, annotated exons are both coding exons and UTRs. Sometimes, they overlap, sometimes they are identical, and sometimes they are completely independent. And example of overlapping exon and UTR is show below.

seqnames	start	end	width	strand	source	type	score	phase	gene_id
1	966502	966614	113	+	ensembl_havana	exon	NA	NA	ENSG00000187583
1	966502	966531	30	+	ensembl_havana	five_prime_utr	NA	NA	ENSG00000187583
…	…	…	…	…	…	…	…	…	…
1	974442	975008	567	+	ensembl_havana	exon	NA	NA	ENSG00000187583
1	974576	975008	433	+	ensembl_havana	three_prime_utr	NA	NA	ENSG00000187583

You can see that both 5 and 3 UTRs overlap an exon. Now, in my case I would like to have a heatmap of read coverage over coding part of the gene (meta-gene, to be more precise). If I use the ‘raw’ annotation and get only rows annotated as exons I will get a mixture of coding parts of the exons as well as UTRs and that’s not what I want.

For me, the easiest way how to trim the UTRs is to use R with the great GenomicFeatures, rtracklayer and data.table packages. Assuming you have Ensembl (or GENCODE) GTF (might for for others as well) you can simply load your GTF to R, get coordinates of exons with UTRs trimmed and move on with your work.

library("GenomicFeatures")

txdb = makeTxDbFromGFF("in.gtf", format = "gtf") # make db
cds <- cdsBy(txdb, by="tx", use.names=T) # get only coding part of the exons = remove UTRs

GenomicFeatures, however, doesn’t preserve the other GTF fields which I personally like. I will therefore load GTF once more and merge it with the coding exons to keep all the extra information.

library("data.table")
library("rtracklayer")

gtf <- rtracklayer::import("in.gtf", format="gtf")
gtf.dt <- as.data.table(gtf)
gtf.dt.exon <- gtf.dt[type=="exon"] # get only exon

cds.dt <- as.data.table(cds)
cds.dt <- cds.dt[, c("group_name", "start", "end", "exon_rank")] # keep only columns we need

gtf.dt.exon[, new:=paste(transcript_id, exon_number, sep=".")] # make replacement id
cds.dt[, new:=paste(group_name, exon_rank, sep=".")] # make replacement id

gtf.dt.exon <- merge(gtf.dt.exon, cds.dt, by = "new", all.x = T) # merge the tables keeping all.x to be able to debug and check
      
gtf.dt.exon <- na.omit(gtf.dt.exon, cols="group_name") # remove exons which do not have an alternative in coding exons
gtf.dt.exon[, start.x := start.y] # update start/end by the cds
gtf.dt.exon[, end.x := end.y]
setnames(gtf.dt.exon, c("start.x", "end.x"), c("start", "end")) # rename
gtf.dt.exon[, c("group_name", "start.y", "end.y", "exon_rank", "new") := NULL] # clean

gtf.dt.exon[, type:="coding_exon"] # change exon column to coding_exon

gtf.out <- rbindlist(list(gtf.dt.exon, gtf.dt)) # add new lines to the original gtf
gtf.out <- gtf.out[order(seqnames, start)] # sort

gtf.out.granges <- makeGRangesFromDataFrame(gtf.out, keep.extra.columns=TRUE, ignore.strand=FALSE) # make the granges for easier exprort
rtracklayer::export(gtf.out.granges, "out.gtf") # export final gtf

And here you go – nice and clean coding part of the exons (column 3 = coding_exon) exported together with the original GTF

seqnames	start	end	width	strand	source	type	score	phase	gene_id
1	966502	966614	113	+	ensembl_havana	exon	NA	NA	ENSG00000187583
1	966502	966531	30	+	ensembl_havana	five_prime_utr	NA	NA	ENSG00000187583
1	966532	966614	113	+	ensembl_havana	coding_exon	NA	NA	ENSG00000187583
…	…	…	…	…	…	…	…	…	…
1	974442	974575	567	+	ensembl_havana	coding_exon	NA	NA	ENSG00000187583
1	974442	975008	567	+	ensembl_havana	exon	NA	NA	ENSG00000187583
1	974576	975008	433	+	ensembl_havana	three_prime_utr	NA	NA	ENSG00000187583

I have then tested this modified GTF in DeepTools (3.5.0) and it works just fine. DeepTools love to complain about GTF is they see something fishy (…nice joke for those who know the logo of DeepTools).

IGV batch snapshot from command line

Right now I am processing a lot of public sequencing datasets from SRA. As always in this situation, a proper QC is necessary to unravel all the adaptor sequences, UMIs, … especially for datasets where we care about precise 5’/3′ ends. In my case, it’s Ribo-Seq and we want to play with 5′ ends in a big detail.

For a long time, my QC pipeline was missing a detailed alignment check. I mean really seing the alignment itself in bases, not some summary statistics. This would help me to see extensive and repetetive softclipping, mismatches, etc. which can point out some preprocessing error. Unless I want to open each bam in IGV I needed to think about something more automated. I knew IGV has a batch mode but I never went into a detail.

To make IGV snapshots automatic and you don’t want to open IGV GUI for all the bams you need to get xvfb-run first. xvfb-run will create a fake X11 which will allow you to launch IGV from command line without opening it’s GUI (inspiration from here which links you to IGV Snapshot Automator tool). Once you have this you can start the fun. It’s possible to it without the xvfb-run but the script would open IGV GUI everytime you try to make a snapshot which is not always possible (for example some ssh connections don’t allow X11 export).

Then, you have to make your batch (~config) file which will instruct IGV what to do. A simple example is available at IGV. I would recommend two snapshots targeted to some region of a housekeeping gene(s) – one which captures a exon-exon boundary to see a splicing event and one with a more detailed zoomed look to see the alignment at a single base resolution. For human with Ensembl genome I am using this one:

genome /home/jan/data/hg38/genome/genome.fa
load /home/jan/projects/test/samples/test/test.bam
snapshotDirectory //home/jan/projects/test/samples/test/Screeshots/
preference SAM.SHOW_SOFT_CLIPPED true
goto 12:112022058-112022719
sort position
snapshot splice.png
goto 12:112022150-112022250
sort position
snapshot zoom.png
exit

It instructs IGV to load user reference genome (can be fasta), load my bam, set snapshot directory, load temporary preferences and then to to a specific position, take snapshot and save it with a specific name, go to another location and do the same, and exit the IGV.

The IGV manual doesn’t say much about the preference option nor they list available options (or I haven’t found them). They help you to adjust behaviour of the IGV in the same way as you do in the Preferences tab in IGV GUI. The easiest way to see the possible settings is to read $HOME/igv/prefs.properties. This lists all the preferences you have change manually in the IGV GUI. Then you just have to guess what is what. Some options are listed at the IGV GitHub but the easiest is to copy your favourite preferences and just use those.

The next trick is to change the regular java execution with the xvfb-run. For example, you can just copy-paste the regular IGV launch script at IGV/igv.sh and replace line

exec java -showversion --module-path="${prefix}/lib" -Xmx16g \

for line

xvfb-run --auto-servernum --server-num=1 java -showversion --module-path="${prefix}/lib" -Xmx16g \

And the last step is to launch the IGV itself. Let’s say you have saved the batch/config file to igv-batch.config, and the modified launch script to igv-batch.sh and you can just execute:

sh igv-batch.sh -b igv-batch.config

Now, you can enjoy your beautiful IGV snaphots.

Part of a slice junction with general overview. You can see there is some softclipping at the beginnign and the end of the reads.

After zoom in to a single base resolution we can see there are 4 additional nucleotides at both ends which need to be trimmed (or clipped) away.

Minimap2 transcriptome mapping

I like using Minimap2 for our Nanopore RNA sequencing a lot. I tried deSALT and and GraphMap but Minimap2 seems to work better for noisy direct RNA-Seq ONT data.

I usually align the reads to both genome and transcriptome reference. However, the transcriptome mapping didn’t behave as I expected. For the ONT RNA-Seq data the Minimap2 manual led me to use:

minimap2 -a -x map-ont -k 12 -p 1 -u f --secondary=yes ref.fa.mmi reads.fastq.gz

The map-ont doesn’t allow to splice the reads, -k 12 helps with the read noise, -p 1 outputs only same scoring alignments (one as primary, rest to secondary mappings) and --secondary=yes allows to output secondary alignments. The highlighted parameter -u f should force forward strand only mapping. Again, from the manual: For Iso-seq, Direct RNA-seq and tranditional full-length cDNAs, it would be desired to apply -u f to force minimap2 to consider the forward transcript strand only. Therefore I was expecting to see forward strand mapping only since I used transcriptome fasta as a reference. However, this was not a case. A simple scan with samtools view -f 16 showed a reasonable number of mappings. What?! I thought it’s some kind of feature of the data and that Minimap2 is doing everything correctly but somehow the reads mapped to the reverse strand do not map anywhere else so it puts them to the reverse strand as a last resort. This all changed with our latest spike-in only (SIRV) experiment.

In this case I knew there should not be any reverse strand mapping whatsoever. But to my surprise, the command above gave me quite a lot of reverse strand mapping. So I dug a bit deeper on where it the problem and I found it. By accident I found a Ubuntu Minimap2 manual (later on I also found the official one here) where it says:

-u CHAR How to find canonical splicing sites GT-AG - f: transcript strand; b: both strands; n: no attempt to match GT-AG [n]

So in reality, -u f has nothing do to with forward strand mapping only. Or at least not in the way I would expect it. It basically looks for predefined splice sites but only in the same direction as is the read. So it makes sense only in genomic spliced alignments, nowhere else. Whoops. The parameter I was looking for turned out to be --for-only which again, by the Ubuntu manual says:

--for-only Only map to the forward strand of the reference sequences. For paired-end reads in the forward-reverse orientation, the first read is mapped to forward strand of the reference and the second read to the reverse stand.

After removing the -u f and replacing it with --for-only I finally got the alignments I was hoping for and absolutely no alignments on the reverse strand. Note: There is also --rev-only option. It should be possible to find this option in the command-line Minimap2 manual if you use:

man ./minimap2.1' for detailed description of these and other advanced command-line options.

but this never worked for me since I install Minimap2 using Conda. On the other hand, I didn’t expect the brief manual wouldn’t mention it. You will also find hardly anything at the Minimap2 GitHub.

Obviously, this is not an issue of Minimap2 because it is in the manual after all. But, you have to really look for the manual in order to find it which I think it’s not the best. Hopefully this will help you to figure out why you are not getting your expected alignments (and you are the same ignorant as me).