49

I have a directory of directories on HDFS, and I want to iterate over the directories. Is there any easy way to do this with Spark using the SparkContext object?

maasg
  • 36,544
  • 11
  • 85
  • 113
Jon
  • 3,905
  • 7
  • 45
  • 74
  • you mean 'iterate' like get the list of sub-directories and files within? or getting all files across all subdirectories? – maasg Nov 19 '14 at 19:23
  • Iterate as in list all the sub-directories. Each subdirectory contains a bunch of text files that I want to process in different ways. – Jon Nov 19 '14 at 19:27

9 Answers9

56

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

And with Spark...

FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)

Edit

It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.

path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)
drew moore
  • 29,711
  • 17
  • 73
  • 109
Mike Park
  • 10,566
  • 2
  • 33
  • 50
  • really nice! [I had this question](http://stackoverflow.com/questions/34738296/spark-spark-submit-jars-arguments-wants-comma-list-how-to-declare-a-directory/35550151#35550151), granted, I guess this wouldn't work in the original spark-submit call – JimLohse Feb 23 '16 at 13:46
  • How can I create a list of the files using the RemoteIterator this creates? – horatio1701d Jan 27 '18 at 13:58
45

Here's PySpark version if someone is interested:

    hadoop = sc._jvm.org.apache.hadoop

    fs = hadoop.fs.FileSystem
    conf = hadoop.conf.Configuration() 
    path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')

    for f in fs.get(conf).listStatus(path):
        print(f.getPath(), f.getLen())

In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.

Other methods of FileStatus object, like getLen() to get file size are described here:

Class FileStatus

Tagar
  • 12,247
  • 5
  • 84
  • 103
21
import  org.apache.hadoop.fs.{FileSystem,Path}

FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

This worked for me.

Spark version 1.5.0-cdh5.5.2

ozw1z5rd
  • 2,764
  • 3
  • 26
  • 47
  • This worked fine for me, for a single folder. Is there some way to get this to run at the level of the parent folder, and get all files in all subfolders? That would be VERY helpful/useful for me. – ASH Jun 28 '19 at 15:03
2

this did the job for me

FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))
Vincent Claes
  • 2,937
  • 3
  • 31
  • 52
2

@Tagar didn't say how to connect remote hdfs, but this answer did:

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration


fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())

status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))

for fileStatus in status:
    print(fileStatus.getPath())
Mithril
  • 11,666
  • 17
  • 90
  • 135
1

Scala FileSystem (Apache Hadoop Main 3.2.1 API)

    import org.apache.hadoop.fs.{FileSystem, Path}
    import scala.collection.mutable.ListBuffer

    
    val fileSystem : FileSystem = {
        val conf = new Configuration()
        conf.set( "fs.defaultFS", "hdfs://to_file_path" )
        FileSystem.get( conf )
    }
      
    val files = fileSystem.listFiles( new Path( path ), false )
    val filenames = ListBuffer[ String ]( )
    while ( files.hasNext ) filenames += files.next().getPath().toString()
    filenames.foreach(println(_))
Bryce
  • 3
  • 3
oetzi
  • 939
  • 9
  • 20
1

I had some issues with other answers(like 'JavaObject' object is not iterable), but this code works for me

fs = self.spark_contex._jvm.org.apache.hadoop.fs.FileSystem.get(spark_contex._jsc.hadoopConfiguration())
i = fs.listFiles(spark_contex._jvm.org.apache.hadoop.fs.Path(path), False)
while i.hasNext():
  f = i.next()
  print(f.getPath())
Hodza
  • 3,030
  • 25
  • 20
0

You can try with globStatus status as well

val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration).globStatus(new org.apache.hadoop.fs.Path(url))

      for (urlStatus <- listStatus) {
        println("urlStatus get Path:"+urlStatus.getPath())
}
Jaap
  • 77,147
  • 31
  • 174
  • 185
Nitin
  • 3,007
  • 2
  • 22
  • 34
0

You can use below code to iterate recursivly through a parent HDFS directory, storing only sub-directories up to a third level. This is useful, if you need to list all directories that are created due to the partitioning of the data (in below code three columns were used for partitioning):

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

def rememberDirectories(fs: FileSystem, path: List[Path]): List[Path] = {
  val buff = new ListBuffer[LocatedFileStatus]()

  path.foreach(p => {
    val iter = fs.listLocatedStatus(p)
    while (iter.hasNext()) buff += iter.next()
  })

  buff.toList.filter(p => p.isDirectory).map(_.getPath)
}

@tailrec
def getRelevantDirs(fs: FileSystem, p: List[Path], counter: Int = 1): List[Path] = {
  val levelList = rememberDirectories(fs, p)
  if(counter == 3) levelList
  else getRelevantDirs(fs, levelList, counter + 1)
}
Michael Heil
  • 13,037
  • 3
  • 31
  • 58