15

I have VCF files (SNPs & indels) for WGS on 100 samples, but I want to only use a specific subset of 10 of the samples. Is there a relatively easy way to pull out only the 10 samples, while still keeping all of the data for the entire genome?

I have a script that allows me to pull out regions of the whole genome for all 100 samples, so if I could do something similar but only put regions for the 10 samples that I want that would be ideal.

user438383
  • 1,679
  • 1
  • 8
  • 21
KLuc
  • 171
  • 1
  • 1
  • 5
  • 2
    Welcome. Do you want random 10 samples or do you want to extract specific 10 samples? Also, do you talk about vcf files containing SNPs/indels ? What is the reason why you want to subsample your vcf file? We like details, details make questions understandable and answerable. You can [edit] your Q and add all the details. – Kamil S Jaron Feb 07 '18 at 16:08

4 Answers4

16

Bcftools has sample/individual filtering as an option for most of the commands. You can subset individuals by using the -s or -S option:

-s, --samples [^]LIST

Comma-separated list of samples to include or exclude if prefixed with "^". Note that in general tags such as INFO/AC, INFO/AN, etc are not updated to correspond to the subset samples. bcftools view is the exception where some tags will be updated (unless the -I, --no-update option is used; see bcftools view documentation). To use updated tags for the subset in another command one can pipe from view into that command. For example:

-S, --samples-file FILE

File of sample names to include or exclude if prefixed with "^". One sample per line. See also the note above for the -s, --samples option. The command bcftools call accepts an optional second column indicating ploidy (0, 1 or 2) or sex (as defined by --ploidy, for example "F" or "M"), and can parse also PED files. If the second column is not present, the sex "F" is assumed. With bcftools call -C trio, PED file is expected. File formats examples:

sample1    1
sample2    2
sample3    2

or

sample1    M
sample2    F
sample3    F

or a .ped file (here is shown a minimum working example, the first column is ignored and the last indicates sex: 1=male, 2=female):

ignored daughterA fatherA motherA 2
ignored sonB fatherB motherB 1

Example usage:

bcftools view -s sample1,sample2 file.vcf > filtered.vcf
bcftools view -S sample_file.txt file.vcf > filtered.vcf

See the bcftools manpage for more information.

gringer
  • 14,012
  • 5
  • 23
  • 79
4

In addition to the answer from @gringer there is a bcftools plugin called split that can do this, but gives you the added ability to output single-sample VCFs by specifying a filename for each sample.

$ bcftools +split

About: Split VCF by sample, creating single-sample VCFs.

Usage: bcftools +split [Options]
Plugin options:
   -e, --exclude EXPR              exclude sites for which the expression is true (applied on the outputs)
   -i, --include EXPR              include only sites for which the expression is true (applied on the outputs)
   -k, --keep-tags LIST            list of tags to keep. By default all tags are preserved
   -o, --output DIR                write output to the directory DIR
   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
   -r, --regions REGION            restrict to comma-separated list of regions
   -R, --regions-file FILE         restrict to regions listed in a file
   -S, --samples-file FILE         list of samples to keep with second (optional) column for basename of the new file
   -t, --targets REGION            similar to -r but streams rather than index-jumps
   -T, --targets-file FILE         similar to -R but streams rather than index-jumps
Examples:
   # Split a VCF file
   bcftools +split input.bcf -Ob -o dir

   # Exclude sites with missing or hom-ref genotypes
   bcftools +split input.bcf -Ob -o dir -i'GT="alt"'

   # Keep all INFO tags but only GT and PL in FORMAT
   bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL

   # Keep all FORMAT tags but drop all INFO tags
   bcftools +split input.bcf -Ob -o dir -k FMT

So if you had the following samples file samples.tsv

sample1    sample1
sample2    sample2

You can run it and get the following

$ bcftools +split -S samples.tsv -o outdir in.vcf
$ ls
in.vcf sample1.vcf sample2.vcf

Without the second column, you would just get a single VCF with the two samples in it (as you would with view)

Michael Hall
  • 663
  • 4
  • 11
2

You can use the GATK's SelectVariants tool with the -sn flag.

E.g.

gatk SelectVariants -V input.vcf -R reference.fasta -sn Sample_01 -out sample.vcf

You may use the -sn flag several times so as to select several samples, or use it to point to a file containing a sample name on every line.

Gx1sptDTDa
  • 121
  • 1
0

From the old VCF manual,

https://vcftools.sourceforge.net/man_latest.html

and

https://manpages.debian.org/testing/vcftools/vcf-subset.1.en.html

cat in.vcf | vcf-subset -r -t indels -e -c SAMPLE1 > out.vcf

comma-separated list of variant types ( indels ) could be obtained.

ii4 unsafe
  • 59
  • 5
  • 2
    This is part of vctools, which is pretty much obsolete. I don't understand why you point to the ubuntu man page when vcftools has the closer-to-source manual. – Ram RS Sep 05 '23 at 15:24