Saving space with SLOW5/BLOW5 (and POD5) compression of FAST5 files

Edit (01/24/2023): As mentioned in Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing, POD5 is another file format that is supposed to be more efficient than the good ol’ FAST5. It is being developed directly by Nanopore, so you can probably expect to see it more often in the near future. I will add it to the comparison when I find the time to do it.

It took me some time to test the new compression of the bulky FAST5 with slow5tools (Fast nanopore sequencing data analysis with SLOW5). Today, I finally almost ran out of space on our server so it was the right time. And I was pleasantly surprised.

Installing on a regular Linux x86_64 architecture is very simple:

$ VERSION=v0.4.0
$ wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
$ ./slow5tools

I tested it on older (2018) MinION direct RNA FAST5 files with 4k FAST5 files. The original size of the directory is 435M. Compression to BLOW5 (binary version of SLOW5) goes:

$ slow5tools f2s indir -d outdir_blow5

The compressed BLOW5 directory was 210M and took ~1s with the default number of processes (change it with -p INT). That’s ~50% of space saved!

There were some warnings, though:

[search_and_warn::WARNING] slow5tools-v0.4.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.

So it seems the conversion/compression removes some structure from the original FAST5. Let’s see what happened. Let’s convert it back to FAST5 with:

$ slow5tools s2f outdir_blow5 -d indir_decompressed

The decompressed (BLOW5->FAST5) directory has 323M. So it’s not a complete 1:1 conversion. The authors (here, here) say that FAST5 has a lot of wasted space which is removed during the compression. Also, SLOW5/BLOW5 doesn’t store the basecalled sequence (ONT stopped putting the basecalled sequence to FAST5 in their latest MinKNOW anyway – source is the same link) but my testing set didn’t have them to begin with. Also, slow5tools merged all the individual input read into 8 bigger FAST5 files (.

But let’s see what’s changed. We can just simply list the input and the output with HDF5 tools:

$ h5ls indir/DESKTOP_DO3S5TU_20180420_FAH85823_MN27347_sequencing_run_RNA_Control_73142_read_100_ch_110_strand.fast5 # One original read                                                                                                                                   
$ Analyses                 Group
$ PreviousReadInfo         Group
$ Raw                      Group
$ UniqueGlobalKey          Group
$ h5ls indir_fast5/0.fast5 | head -5 # One of the decompressed files
$ read_00c47597-464e-4263-94a1-b8794353d553 Group
$ read_0126ca44-9355-44a2-ac2a-5cb6b29715ff Group
$ read_012cffa1-d978-49f9-8746-91f501379dfd Group
$ read_0264ff46-f1d7-4983-a351-061dfe793b2d Group
$ read_032a1da5-9a9d-418a-831f-d17a3f7c2ca9 Group

We see that some fields from the original FAST5 are missing. This might be a problem if you have some uncommon or additional fields in your FAST5 that you would like to preserve. You will lose them with the compression.

But let’s see how it looks if we basecall the original and the decompressed FAST5 files. Run Guppy (I only have a very old version available 3.3.2+739c7e5):

$ guppy_basecaller \
    -x "auto" \
    --flowcell FLO-MIN106 \
    --kit SQK-RNA002 \
    --hp_correct on \
    --records_per_fastq 0 \
    --u_substitution off \
    --trim_strategy none \
    --input_path indir/ \
    --save_path indir/guppy/ \
    --recursive \
    --compress_fastq \
    --fast5_out

Basecalling of the original FAST5 directory took 12712ms. Basecalling of the decompressed directory took 12043ms. Size of the original basecalled output directory was 576M and of the decompressed 466M. That’s still a ~20% difference.

I then ran a simple fastqc on the basecalled fastq files and they were identical (good!). I am not going to publish here the outputs. You’ll have to trust me (and test it yourself, of course).

To sum it up – slow5tools can save a lot of space if you are like me with 50 MinION runs. But you have to be careful with the lost fields. Also, I haven’t tested it with bulk FAST5 files. I strongly recommend reading the closed issues tab at slow5tools GitHub. A lot of missing things are explained there. If you don’t care about them, BLOW5 is the way to go!

SLOW5 format is not only about compression! The format can also improve the processing of the Nanopore reads/files. Check out the publication for more information.