I'm doing some analysis and I need to subset a large VCF file (~8GB gziped) given a bed interval and identify within a subset of rsid.
Unfortunately, both my normal choices to do this analysis (snpSift and bedtools) are taking way to long or failing due to memory issues in my local computer and a remote server.
Do you guys know any other options or suggestions to speed up this process?
Follow the commands I use:
bedtools intersect -a <myvcf>.vcf.gz -b <myinterval>.bed -wa | \
java -Xmx10g -jar snpSift.jar filter --set <myrsid>.txt "ID in SET[0]"
or
gzcat <myvcf>.vcf.gz | \
java -Xmx10g -jar snpSift.jar intervals <mybed>.bed | \
java -Xmx10g -jar snpSift.jar filter --set <myrsid>.txt "ID in SET[0]"
The bedtools command usually fail due to unknown reason and SnpSift runs over 6 hours even given 10GB of ram. My local machine have 8GB of RAM, but the server have 32GB.