Spark-submit "No module named 'pandas'" on Spark on Docker with Conda environment

Question

While trying to get a small proof-of-concept application written in Python running on Apache Spark, I have encountered the following problem:

Using https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html, I tried to add all required packages for my application using a packed conda environment, including pandas. However, I still get this error:

Traceback (most recent call last):
  File "/etc/python/spark.quickpoc.py", line 51, in <module>
    main(spark)
  File "/etc/python/spark.quickpoc.py", line 7, in main
    import pandas
ModuleNotFoundError: No module named 'pandas'`

I use the following shell script (as shown in the tutorial):

export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment spark.quickpoc.py

and import pandas as usual:

def main(spark):
  import json
  import pandas
...

Apache Spark itself runs as containers in Docker Desktop, using the following docker-compose.yml:

version: "3.7"

services: 
    spark:
        image: docker.io/bitnami/spark:latest
        environment:
            - SPARK_MODE=master
            - SPARK_RPC_AUTHENTICATION_ENABLED=no
            - SPARK_RPC_ENCRYPTION_ENABLED=no
            - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
            - SPARK_SSL_ENABLED=no
        ports:
            - '8080:8080'
        volumes:
            - .spark/quick_poc/spark.quickpoc.py:/etc/python/spark.quickpoc.py
            - .spark/pyspark_conda_env.tar.gz:/etc/python/pyspark_conda_env.tar.gz
            - .spark/quick_poc/submit_app.sh:/etc/python/submit_app.sh
    spark-worker:
        image: docker.io/bitnami/spark:latest
        environment:
            - SPARK_MODE=worker
            - SPARK_MASTER_URL=spark://spark:7077
            - SPARK_WORKER_MEMORY=1G
            - SPARK_WORKER_CORES=1
            - SPARK_RPC_AUTHENTICATION_ENABLED=no
            - SPARK_RPC_ENCRYPTION_ENABLED=no
            - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
            - SPARK_SSL_ENABLED=no

Thank you for your help, Rumo

does [this](https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work) help? — samkart, Dec 15 '21 at 13:45
It does, thanks. I would prefer using the conda environment, because it seems like the more stable option, as specified in this [article](https://alkaline-ml.com/2018-07-02-conda-spark/). If I can't find another solution, I might give this a try. — Rumo, Dec 16 '21 at 08:54

Spark-submit "No module named 'pandas'" on Spark on Docker with Conda environment

0 Answers0