Can't read a VCF file through spark

Question

I am trying to read a vcf file using spark.

Spark 3.0

spark.read.format("com.databricks.vcf").load("vcfFilePath")

Error:

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.vcf. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:674)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: com.databricks.vcf.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:648)
  at scala.util.Try$.apply(Try.scala:213)
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:648)
  at scala.util.Failure.orElse(Try.scala:224)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:648)
  ... 52 more

I have tried in spark in local ubuntu, and have also tried in databricks environment. Can you folks help me with this?

I did not put the jars on spark home/jars. I'm giving it in --packages @mck — Raptor0009, Jan 10 '21 at 03:28
I'm not using Genomic runtime, but I'm able to read the VCF file , it's throwing schema mismatch kind of error. @AlexOtt — Raptor0009, Jan 10 '21 at 03:29

Bram · Accepted Answer · 2021-01-10T07:33:35.963

2

On Databricks (as Alex mentioned) you have to use the Databricks Genomics Runtime (see the picture below).

If you want to work with VCF files with Spark on your local machine, then you have to add the Glow package manually. This package contains the VCF reader. The official documentation here describes the steps that you have to do in detail.

For PySpark locally, the instructions are something like this:

# Install pyspark
pip install pyspark==3.0.1
# Install Glow
pip install glow.py
# Start PySpark with the Glow Maven package
psypark --packages io.projectglow:glow-spark3_2.12:0.6.0

In the Python shell:

import glow
glow.register(spark)
df = spark.read.format('vcf').load(path)

To load the example from the PDF document that you mentioned, you have to make sure to replace the spaces with tabs, otherwise you will get a malformed header exception. The VCF format requires each record and the header to be delimited by tabs.

edited Jan 10 '21 at 07:33

answered Jan 04 '21 at 20:59

Bram

306
1
4

I'm Using spark in Local but I took the example file from this path (http://samtools.github.io/hts-specs/VCFv4.2.pdf). I'm getting the below error. Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: unknown column name 'CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002'; it does not match a legal column header name. – Raptor0009 Jan 10 '21 at 03:31
Which example do you mean? Did you copy the example from the PDF document? That doesn't work because it contains spaces and not tabs. Columns and fields need to be delimited by tabs. If you replace the spaces by tabs, then the example from the PDF works. E.g. like I did here: https://pastebin.com/Ap7HkvMN – Bram Jan 10 '21 at 07:30
Yes @Bram , I copied the example from the PDF. But I corrected it with tab's its working now. – Raptor0009 Jan 10 '21 at 08:07

Can't read a VCF file through spark

1 Answers1

Linked