Read VCF from s3 with spark.read.format("vcf")

Question

I'm using GLOW and trying to read a VCF from s3. The standard way to do it would be

# Start GLOW
./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.1.2 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec
import glow
spark=glow.register(spark)

# Attempt to read VCF
df = spark.read.format("com.databricks.vcf").load('s3://bucket/temp.vcf')
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

No luck. I know the smart_open package is capable of reading from s3, so I give that a try.

from smart_open import open
df = spark.read.format("vcf").load(open('s3://bucket/temp.vcf','r'))
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String

So that fails also. Finally, I try boto3 as suggested by this post, but no luck.

import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('s3://mybucket')
obj = bucket.Object(key='any.vcf')
response=obj.get()
df = spark.read.format("vcf").load(response)
py4j.protocol.Py4JJavaError: An error occurred while calling o182.load.
: java.lang.ClassCastException: java.util.HashMap cannot be cast to java.lang.String

Also, my situation is similar to this question's Can't read a VCF file through spark except that I also want to read from s3, which is why the solutions given there don't work.

Anyway, has anyone found a way to read vcfs from s3 with spark.read? Thanks for any help.

Read VCF from s3 with spark.read.format("vcf")

0 Answers0