I am using a Spark Databricks cluster and want to add a customized Spark configuration.
There is a Databricks documentation on this but I am not getting any clue how and what changes I should make. Can someone pls share the example to configure the Databricks cluster.
Is there any way to see the default configuration for Spark in the Databricks cluster.
Asked
Active
Viewed 5,885 times
4
Stark
- 484
- 10
- 26
1 Answers
3
To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.
- On the cluster configuration page, click the Advanced Options toggle.
- Click the Spark tab.
[OR]
When you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create cluster request or Edit cluster request.
To set Spark properties for all clusters, create a global init script:
%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
|}
|EOF
""".stripMargin, true)
Reference: Databricks - Spark Configuration
Example: You can pick any spark configuration you want to test, here I want to specify "spark.executor.memory 4g",and the custom configuration looks like this.
After the cluster created, you can check out the result of custom configuration.
CHEEKATLAPRADEEP-MSFT
- 11,445
- 1
- 14
- 35
-
This is what I have mention in the question that "There is a Databricks documentation". But I wanted to know what and how we have add the spark configuration. – Stark Nov 04 '19 at 07:43
-
Hey @Stark, you can checkout the example provided in the answer. Do let me know if any help needed. – CHEEKATLAPRADEEP-MSFT Nov 04 '19 at 08:20
-
I am facing OOM issue so I thought I should make some change in cluster config. OOM is coming after executing the Spark job after running 10 or more times. I am executing the pipeline on the same data. but sometimes it is getting failed. https://stackoverflow.com/questions/58640218/databricks-spark-java-lang-outofmemoryerror-gc-overhead-limit-exceeded-i – Stark Nov 04 '19 at 09:00
-
Any idea what exactly should I do to resolve the issue. – Stark Nov 04 '19 at 09:01
-
Hi @Stark Have you tried the above example "spark.executor.memory 4g" and execute the spark job. – CHEEKATLAPRADEEP-MSFT Nov 04 '19 at 09:06
-
I can see 10.8 GB for each executor so I did not updated it. – Stark Nov 04 '19 at 09:14
-
Is there any way to see the default configuration for Spark in the Databricks cluster. – Stark Nov 04 '19 at 09:25
-
Run the scala code to get all default configuration for spark in a notebook. [line1] val configMap = spark.conf.getAll [line2] configMap.foreach(println) – CHEEKATLAPRADEEP-MSFT Nov 04 '19 at 09:48
-
should I execute this directly into the databricks notebook? – Stark Nov 04 '19 at 09:50
-
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/201824/discussion-between-cheekatlapradeep-msft-and-stark). – CHEEKATLAPRADEEP-MSFT Nov 04 '19 at 09:52
-
@CHEEKATLAPRADEEP-MSFT Sir, can you please try to advise me on this problem. https://stackoverflow.com/questions/62094327/how-to-use-databricks-job-spark-configuration-spark-conf?noredirect=1#comment109830651_62094327 – Pavan_Obj May 30 '20 at 21:45
-
The OOM Issue might be because any one of the executor is getting heavily loaded due to data Skewness (worth checking for hotspots in worker nodes), may I know which spark version are you using in Databricks? – Anish Sarangi Oct 21 '21 at 17:37