While trying to get a small proof-of-concept application written in Python running on Apache Spark, I have encountered the following problem:
Using https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html, I tried to add all required packages for my application using a packed conda environment, including pandas. However, I still get this error:
Traceback (most recent call last):
File "/etc/python/spark.quickpoc.py", line 51, in <module>
main(spark)
File "/etc/python/spark.quickpoc.py", line 7, in main
import pandas
ModuleNotFoundError: No module named 'pandas'`
I use the following shell script (as shown in the tutorial):
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment spark.quickpoc.py
and import pandas as usual:
def main(spark):
import json
import pandas
...
Apache Spark itself runs as containers in Docker Desktop, using the following docker-compose.yml:
version: "3.7"
services:
spark:
image: docker.io/bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
volumes:
- .spark/quick_poc/spark.quickpoc.py:/etc/python/spark.quickpoc.py
- .spark/pyspark_conda_env.tar.gz:/etc/python/pyspark_conda_env.tar.gz
- .spark/quick_poc/submit_app.sh:/etc/python/submit_app.sh
spark-worker:
image: docker.io/bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
Thank you for your help, Rumo