--archives, --files, --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly?
- 6,332
- 3
- 42
- 63
- 1,466
- 1
- 16
- 16
-
The last two are explicitly from the SparkContext object, and the first 3 are from the terminal (though I can't seem to find a reference to archives in the submitting applications documentation) – OneCricketeer Jun 28 '16 at 03:04
1 Answers
These options are truly scattered all over the place.
In general, add your data files via --files or --archives and code files via --py-files. The latter will be added to the classpath (c.f., here) so you could import and use.
As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here)
Behind the scenes,
pysparkinvokes the more generalspark-submitscript.You can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to
--py-files
The
--filesand--archivesoptions support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
addFile(path)Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
addPyFile(path)Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
- 1
- 1
- 2,584
- 2
- 21
- 32
-
5Why there is no `addArchives(path)` function, how can I add archives from code. – xiaobing Jan 15 '19 at 03:48
-
Very nice answer sir, the answer may be more complete with a SparkFiles.get reference (https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkFiles) Since It is not very clear what's its relation with the other stuff you described and what should be a proper use of it among the rest. – ciurlaro Apr 02 '21 at 10:39