Saving space with SLOW5/BLOW5 (and POD5) compression of FAST5 files

Edit (01/24/2023): As mentioned in Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing, POD5 is another file format that is supposed to be more efficient than the good ol’ FAST5. It is being developed directly by Nanopore, so you can probably expect to see it more often in the near future. I will add it to the comparison when I find the time to do it.

It took me some time to test the new compression of the bulky FAST5 with slow5tools (Fast nanopore sequencing data analysis with SLOW5). Today, I finally almost ran out of space on our server so it was the right time. And I was pleasantly surprised.

Installing on a regular Linux x86_64 architecture is very simple:

$ VERSION=v0.4.0
$ wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
$ ./slow5tools

I tested it on older (2018) MinION direct RNA FAST5 files with 4k FAST5 files. The original size of the directory is 435M. Compression to BLOW5 (binary version of SLOW5) goes:

$ slow5tools f2s indir -d outdir_blow5

The compressed BLOW5 directory was 210M and took ~1s with the default number of processes (change it with -p INT). That’s ~50% of space saved!

There were some warnings, though:

[search_and_warn::WARNING] slow5tools-v0.4.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.

So it seems the conversion/compression removes some structure from the original FAST5. Let’s see what happened. Let’s convert it back to FAST5 with:

$ slow5tools s2f outdir_blow5 -d indir_decompressed

The decompressed (BLOW5->FAST5) directory has 323M. So it’s not a complete 1:1 conversion. The authors (here, here) say that FAST5 has a lot of wasted space which is removed during the compression. Also, SLOW5/BLOW5 doesn’t store the basecalled sequence (ONT stopped putting the basecalled sequence to FAST5 in their latest MinKNOW anyway – source is the same link) but my testing set didn’t have them to begin with. Also, slow5tools merged all the individual input read into 8 bigger FAST5 files (.

But let’s see what’s changed. We can just simply list the input and the output with HDF5 tools:

$ h5ls indir/DESKTOP_DO3S5TU_20180420_FAH85823_MN27347_sequencing_run_RNA_Control_73142_read_100_ch_110_strand.fast5 # One original read                                                                                                                                   
$ Analyses                 Group
$ PreviousReadInfo         Group
$ Raw                      Group
$ UniqueGlobalKey          Group
$ h5ls indir_fast5/0.fast5 | head -5 # One of the decompressed files
$ read_00c47597-464e-4263-94a1-b8794353d553 Group
$ read_0126ca44-9355-44a2-ac2a-5cb6b29715ff Group
$ read_012cffa1-d978-49f9-8746-91f501379dfd Group
$ read_0264ff46-f1d7-4983-a351-061dfe793b2d Group
$ read_032a1da5-9a9d-418a-831f-d17a3f7c2ca9 Group

We see that some fields from the original FAST5 are missing. This might be a problem if you have some uncommon or additional fields in your FAST5 that you would like to preserve. You will lose them with the compression.

But let’s see how it looks if we basecall the original and the decompressed FAST5 files. Run Guppy (I only have a very old version available 3.3.2+739c7e5):

$ guppy_basecaller \
    -x "auto" \
    --flowcell FLO-MIN106 \
    --kit SQK-RNA002 \
    --hp_correct on \
    --records_per_fastq 0 \
    --u_substitution off \
    --trim_strategy none \
    --input_path indir/ \
    --save_path indir/guppy/ \
    --recursive \
    --compress_fastq \
    --fast5_out

Basecalling of the original FAST5 directory took 12712ms. Basecalling of the decompressed directory took 12043ms. Size of the original basecalled output directory was 576M and of the decompressed 466M. That’s still a ~20% difference.

I then ran a simple fastqc on the basecalled fastq files and they were identical (good!). I am not going to publish here the outputs. You’ll have to trust me (and test it yourself, of course).

To sum it up – slow5tools can save a lot of space if you are like me with 50 MinION runs. But you have to be careful with the lost fields. Also, I haven’t tested it with bulk FAST5 files. I strongly recommend reading the closed issues tab at slow5tools GitHub. A lot of missing things are explained there. If you don’t care about them, BLOW5 is the way to go!

SLOW5 format is not only about compression! The format can also improve the processing of the Nanopore reads/files. Check out the publication for more information.

Minimap2 transcriptome mapping

I like using Minimap2 for our Nanopore RNA sequencing a lot. I tried deSALT and and GraphMap but Minimap2 seems to work better for noisy direct RNA-Seq ONT data.

I usually align the reads to both genome and transcriptome reference. However, the transcriptome mapping didn’t behave as I expected. For the ONT RNA-Seq data the Minimap2 manual led me to use:

minimap2 -a -x map-ont -k 12 -p 1 -u f --secondary=yes ref.fa.mmi reads.fastq.gz

The map-ont doesn’t allow to splice the reads, -k 12 helps with the read noise, -p 1 outputs only same scoring alignments (one as primary, rest to secondary mappings) and --secondary=yes allows to output secondary alignments. The highlighted parameter -u f should force forward strand only mapping. Again, from the manual: For Iso-seq, Direct RNA-seq and tranditional full-length cDNAs, it would be desired to apply -u f to force minimap2 to consider the forward transcript strand only. Therefore I was expecting to see forward strand mapping only since I used transcriptome fasta as a reference. However, this was not a case. A simple scan with samtools view -f 16 showed a reasonable number of mappings. What?! I thought it’s some kind of feature of the data and that Minimap2 is doing everything correctly but somehow the reads mapped to the reverse strand do not map anywhere else so it puts them to the reverse strand as a last resort. This all changed with our latest spike-in only (SIRV) experiment.

In this case I knew there should not be any reverse strand mapping whatsoever. But to my surprise, the command above gave me quite a lot of reverse strand mapping. So I dug a bit deeper on where it the problem and I found it. By accident I found a Ubuntu Minimap2 manual (later on I also found the official one here) where it says:

-u CHAR How to find canonical splicing sites GT-AG - f: transcript strand; b: both strands; n: no attempt to match GT-AG [n]

So in reality, -u f has nothing do to with forward strand mapping only. Or at least not in the way I would expect it. It basically looks for predefined splice sites but only in the same direction as is the read. So it makes sense only in genomic spliced alignments, nowhere else. Whoops. The parameter I was looking for turned out to be --for-only which again, by the Ubuntu manual says:

--for-only Only map to the forward strand of the reference sequences. For paired-end reads in the forward-reverse orientation, the first read is mapped to forward strand of the reference and the second read to the reverse stand.

After removing the -u f and replacing it with --for-only I finally got the alignments I was hoping for and absolutely no alignments on the reverse strand. Note: There is also --rev-only option. It should be possible to find this option in the command-line Minimap2 manual if you use:

man ./minimap2.1' for detailed description of these and other advanced command-line options. 

but this never worked for me since I install Minimap2 using Conda. On the other hand, I didn’t expect the brief manual wouldn’t mention it. You will also find hardly anything at the Minimap2 GitHub.

Obviously, this is not an issue of Minimap2 because it is in the manual after all. But, you have to really look for the manual in order to find it which I think it’s not the best. Hopefully this will help you to figure out why you are not getting your expected alignments (and you are the same ignorant as me).