11

I have multiple zip files containing two types of files(A.csv & B.csv)

/data/jan.zip --> contains A.csv & B.csv
/data/feb.zip --> contains A.csv & B.csv

I want to read the contents of all the A.csv files inside all the zip files using pyspark.

 textFile = sc.textFile("hdfs://<HDFS loc>/data/*.zip")

Can someone tell me how to get the contents of A.csv files into an RDD?

zero323
  • 305,283
  • 89
  • 921
  • 912
Munesh
  • 1,430
  • 3
  • 20
  • 45

1 Answers1

-1

Here you want to read all csv files inside the zip files recursively.

val files = sc.CSVFiles("file://path/to/files/*.zip")
files.flatMap({case (name, content) =>
  unzip(content)
})

def unzip(content: String): List[String] = {
  ...
}
Ramineni Ravi Teja
  • 2,967
  • 22
  • 32