41

I'm trying to run a script in the pyspark environment but so far I haven't been able to.

How can I run a script like python script.py but in pyspark?

Ani Menon
  • 25,420
  • 16
  • 92
  • 119
Daniel Rodríguez
  • 584
  • 3
  • 6
  • 15

6 Answers6

47

You can do: ./bin/spark-submit mypythonfile.py

Running python applications through pyspark is not supported as of Spark 2.0.

Ulas Keles
  • 1,481
  • 15
  • 19
  • 1
    Thanks for the answer, can you tell me how to do it in Windows? – Daniel Rodríguez Oct 13 '16 at 19:17
  • 2
    @DanielRodríguez Should be the same. The spark folder you downloaded should have a `spark-submit` file – OneCricketeer Oct 13 '16 at 19:25
  • 2
    It tells me that 'sc' is not defined, and when I run spark-submit after opening pyspark it throws an invalid syntax error – Daniel Rodríguez Oct 13 '16 at 19:49
  • 2
    It sounds like you are haven't initialized an 'sc' variable with SparkContext(). Take a look at this page, if you haven't already done so https://spark.apache.org/docs/0.9.0/python-programming-guide.html. It's hard to tell what you might be doing wrong without seeing your code. – Ulas Keles Oct 13 '16 at 20:27
29

pyspark 2.0 and later execute script file in environment variable PYTHONSTARTUP, so you can run:

PYTHONSTARTUP=code.py pyspark

Compared to spark-submit answer this is useful for running initialization code before using the interactive pyspark shell.

Jussi Kujala
  • 851
  • 9
  • 7
19

Just spark-submit mypythonfile.py should be enough.

Selva
  • 1,925
  • 1
  • 20
  • 18
8

You can execute "script.py" as follows

pyspark < script.py

or

# if you want to run pyspark in yarn cluster
pyspark --master yarn < script.py
Arun Annamalai
  • 715
  • 1
  • 7
  • 18
2

Existing answers are right (that is use spark-submit), but some of us might want to just get started with a sparkSession object like in pyspark.

So in the pySpark script to be run first add:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master('yarn') \
    .appName('pythonSpark') \
    .enableHiveSupport()
    .getOrCreate()

Then use spark.conf.set('conf_name', 'conf_value') to set any configuration like executor cores, memory, etc.

Ani Menon
  • 25,420
  • 16
  • 92
  • 119
1

Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is,

$ spark-submit --master <url> <SCRIPTNAME>.py.

I'm running spark in windows 64bit architecture system with JDK 1.8 version.

P.S find a screenshot of my terminal window. Code snippet

Community
  • 1
  • 1