2

I have tried to import another python file in my current pyspark program using Sparkcontext.It was giving me error as multiple spark context cannot run at once.Hence I am using spark session to import my python file. My code is :

spark = SparkSession.builder.appName('Recommendation_system').getOrCreate()
txt=spark.addFile('engine.py')
dataset_path = os.path.join('Musical_Instruments_5.json')
app = create_app(txt,dataset_path)

I am getting error as follows:

AttributeError: 'SparkSession' object has no attribute 'addFile'

What will be the correct way of importing python file using spark session.

Neha patel
  • 123
  • 2
  • 9

2 Answers2

2

You should use 'addFile' method of class:

  pyspark.SparkContext

API reference

computatma
  • 71
  • 5
0

The answer to this question might depend on Spark running in client or cluster mode, as mentioned in this SO answer. For Pyspark, an optimal solution might be to add the --py-files flag to the environment variable PYSPARK_SUBMIT_ARGS and this will work in any case. This can be done by pointing to your file as follows:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files "/path/to/file/engine.py" pyspark-shell'

You can even specify the path to a .zip containing multiple files as mentioned in the official Spark documentation here.

That works when using Pyspark from a notebook environment, for example. A more general solution might be to add this file using the option spark.submit.pyFiles inside the spark config file spark-defaults.conf. This will even work when running your job using spark-submit from the command line. Check the spark configuration options here for more information.