#heatmap – Jan's Blog of Bioinformatics Bits

I like using deepTools in my analyses but it might get very slow if you work with big(er) files. Yes, it offers multithreading, but that often doesn’t help and using a lot of threads already killed our server once (it wasn’t me, I swear).

One of the slowest part I have encountered is the computeMatrix module. If your references (-R) are just a couple hundred lines, it will get veeeery slow. After a few days of frustration and waiting to get the matrix I have decided this is not the way to go.

It seems computeMatrix doesn’t scale well with increasing number of reference positions. It turned out to be much more effient to subsample the references, calculate matrices and then merge them together. A simple table of the runtimes looks like this:

Number of positions in -R	Time / seconds
1,000	4
5,000	20
10,000	58
25,000	275

Fig 1. Runtime of deepTools computeMatrix on different number of reference (-R) positions. The original reference file was randomly sampled 3-times (shuf -l number_of_reads) for each number of reads and the runtimes were averaged.

It seems, that in this experiment, the sweet spot is somewhere around 5,000 positions but I think it might be different in every experiment. Since my references have around 100,000-150,000 positions splitting them to 5,000 chunks doesn’t overload the system much.

This is the final code looks like this:

	#!/bin/bash
	#
	# Speed up deepTools computeMatrix by splitting the references into smaller chunks and then merging the matrices together
	#

	positions=5000
	threads=12

	rnd=$RANDOM

	# split reference into chunks by number of lines
	split -l $positions ref.bed ref.chunks${rnd}

	for chunk in ref.chunks${rnd}*; do
	# Rename name column (4) in bed to avoid potential problems which deepTools naming which might happen if the reference position name are not unique
	name=$(basename $chunk)
	name=${name##*.}

	cat $chunk \| awk -v name=$name 'BEGIN {FS = "\t"; OFS = "\t"} {print $1,$2,$3,name,$5,$6}' > tmp.$rnd && mv tmp.$rnd $chunk
	done

	# calculate matrix for each chunk
	for chunk in ref.chunks${rnd}*; do
	computeMatrix reference-point \
	–referencePoint TSS \
	-R $chunk \
	-S input.bw \
	-b 500 -a 500 \
	–skipZeros \
	–missingDataAsZero \
	–binSize 10 \
	–averageTypeBins median \
	–numberOfProcessors $threads \
	–outFileName ${chunk}.gz
	done

	# merge the chunks back to one file
	computeMatrixOperations rbind -m ref.chunks${rnd}.gz -o ref.matrix.gz && rm ref.chunks${rnd}.gz

	# make heatmaps
	plotHeatmap \
	-m ref.matrix.gz \
	–sortUsing mean \
	–averageTypeSummaryPlot mean \
	–missingDataColor "#440154" \
	–colorMap viridis \
	–zMax 100 \
	–linesAtTickMarks \
	–refPointLabel "TSS" \
	–heatmapHeight 20 \
	–heatmapWidth 10 \
	–dpi 300 \
	–outFileName ref.png

	rm ref.chunks${rnd}*

view raw deeptools-heatmap-fast.sh hosted with ❤ by GitHub