bedtools merge book-end merge

I never paid too much attention to this (I should have), but TIL that bedtools merge default is not only to merge overlapping regions but also to merge book-ended regions.

From the bedtools merge manual: -d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

What does this mean? With the default (-d 0) settings, bedtools merge will also merge the following regions:

# Two regions with book-end relationship - they are next to each other like books:
12345678
AAAA
BBBB
# Will get merged with the default bedtools merge -d 0 settings into:
12345678
AAAABBBB

To avoid merging book-ended regions, you have to change the default to -d -1. The example above will not get merged anymore and bedtools merge will merge only truly overlapping regions.

Note: bedtools merge is smart enough to correctly merge VCF files with END info field. This is useful, for example, when working with VCFs with structural variants, such as CNVs. You don’t have to convert the VCF into BED, merge, and convert back to VCFs.

Extract a single sample from multi-sample VCF

Sometimes you want to subset/extract just a single sample from a large multi-sample VCF. In my case, we wanted to get per-sample VCF from the 1000 Genome Project.

As there is always more than one way to skin a cat, I compared vcftools and bcfools. It turned out bcftools is ~50% faster than vcftools and then both produce identical output.

## Get a large VCF
wget ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz && tabix ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz

## Decide on a sample to extract
SAMPLE="NA18941"

## Extract and recode (keep only variants relevant for this sample) VCF for this specific sample without changing anything else
# bcftools
time bcftools view --samples $SAMPLE --min-ac=1         --no-version --no-update ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz > $SAMPLE.vcf
real    0m4.738s
user    0m4.717s
sys     0m0.017s

# vcftools
time vcftools --gzvcf ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz         --indv $SAMPLE --non-ref-ac-any 1 --recode --recode-INFO-all --out $SAMPLE
real    0m7.505s
user    0m7.491s
sys     0m0.012s

# Compare bcftools and vcftools output
 diff NA18941.vcf NA18941.recode.vcf
<empty>

bcftools is much faster (~50%) and much “smarter” than vcftools. By “smarter” I mean bcftools can directly process vcf.gz without any changes in the code. In vcftoolf, you have to change from --vcf to --gzvcf. Also, the vcftools has default output name $SAMPLE.recode.vcf whereas bcftools has stdout by default (so you can pipe it to bgzip or similar).