variants – Jan's Blog of Bioinformatics Bits

Sometimes you want to subset/extract just a single sample from a large multi-sample VCF. In my case, we wanted to get per-sample VCF from the 1000 Genome Project.

As there is always more than one way to skin a cat, I compared vcftools and bcfools. It turned out bcftools is ~50% faster than vcftools and then both produce identical output.

## Get a large VCF
wget ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz && tabix ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz

## Decide on a sample to extract
SAMPLE="NA18941"

## Extract and recode (keep only variants relevant for this sample) VCF for this specific sample without changing anything else
# bcftools
time bcftools view --samples $SAMPLE --min-ac=1         --no-version --no-update ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz > $SAMPLE.vcf
real    0m4.738s
user    0m4.717s
sys     0m0.017s

# vcftools
time vcftools --gzvcf ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz         --indv $SAMPLE --non-ref-ac-any 1 --recode --recode-INFO-all --out $SAMPLE
real    0m7.505s
user    0m7.491s
sys     0m0.012s

# Compare bcftools and vcftools output
 diff NA18941.vcf NA18941.recode.vcf
<empty>

bcftools is much faster (~50%) and much “smarter” than vcftools. By “smarter” I mean bcftools can directly process vcf.gz without any changes in the code. In vcftoolf, you have to change from --vcf to --gzvcf. Also, the vcftools has default output name $SAMPLE.recode.vcf whereas bcftools has stdout by default (so you can pipe it to bgzip or similar).

Tag: variants

Extract a single sample from multi-sample VCF