Sometimes you want to subset/extract just a single sample from a large multi-sample VCF. In my case, we wanted to get per-sample VCF from the 1000 Genome Project.
As there is always more than one way to skin a cat, I compared vcftools
and bcfools
. It turned out bcftools
is ~50% faster than vcftools
and then both produce identical output.
## Get a large VCF
wget ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz && tabix ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz
## Decide on a sample to extract
SAMPLE="NA18941"
## Extract and recode (keep only variants relevant for this sample) VCF for this specific sample without changing anything else
# bcftools
time bcftools view --samples $SAMPLE --min-ac=1 --no-version --no-update ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz > $SAMPLE.vcf
real 0m4.738s
user 0m4.717s
sys 0m0.017s
# vcftools
time vcftools --gzvcf ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz --indv $SAMPLE --non-ref-ac-any 1 --recode --recode-INFO-all --out $SAMPLE
real 0m7.505s
user 0m7.491s
sys 0m0.012s
# Compare bcftools and vcftools output
diff NA18941.vcf NA18941.recode.vcf
<empty>
bcftools
is much faster (~50%) and much “smarter” than vcftools
. By “smarter” I mean bcftools
can directly process vcf.gz
without any changes in the code. In vcftoolf
, you have to change from --vcf
to --gzvcf
. Also, the vcftools
has default output name $SAMPLE.recode.vcf
whereas bcftools
has stdout by default (so you can pipe it to bgzip
or similar).