Converting RDD to spark data frames in python and then accessing a particular values of columns

Question

I want to access values of a particular column from a data sets that I've read from a csv file. The datasets are stored in pyspark RDD which I want to be converted into the DataFrame. I am using the below code :

from pyspark.sql import SQLContext
sqlc=SQLContext(sc)
df=sc.textFile(r'D:\Home\train.csv')
df=sqlc.createDataFrame(df)

but it is showing error:

Can not infer schema for type: <class 'str'>

First 2 rows of df are :

['"id","product_uid","product_title","search_term","relevance"',
 '2,100001,"Simpson Strong-Tie 12-Gauge Angle","angle bracket",3']

I think the first row is creating this problem. Moreover I want to create data frame which stores the values from 2nd row to last.(Not the first row because it will be the header). How can I achieve this ? I've searched for it but could not find any solution. Thanks in advance.

Which version of spark are you using ? – eliasah Aug 07 '16 at 17:03 — eliasah, Aug 07 '16 at 17:03
@eliasah I am using spark 2.0.0 – Ishan Aug 08 '16 at 04:35 — Ishan, Aug 08 '16 at 04:35

phi · Answer 1 · 2016-08-06T17:19:11.320

1

To read a csv file to spark dataframe you should use spark-csv. https://github.com/databricks/spark-csv

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

How to use spark csv If you are using pyspark directly from the terminal. Instead of calling

$SPARKHOME/bin/pyspark

You have to use

$SPARKHOME/bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0

and then use the code above.

If you are using ipython + findspark, you'll have to modify your PYSPARK_SUBMIT_ARGS (before starting ipython)

export PYSPARK_SUBMIT_ARGS=--master local[4] --packages "com.databricks:spark-csv_2.11:1.4.0" pyspark-shell

edited Aug 06 '16 at 17:19

answered Aug 05 '16 at 15:41

phi

163
6

Can you tell me how can I use them with pyspark in windows ? I am new to pyspark btw. – Ishan Aug 06 '16 at 06:36
It's not really much different in Windows. The arguments to pyspark are still the same, you'll just have a slightly different way of setting the suggested environment variable. Possibly check this question for more, or post a separate question about running pyspark under Windows. – Brian Cline Aug 07 '16 at 23:02
I am using iPython with spark, do I have to create an environment variable PYSPARK_SUBMIT_ARGS ? And whenever I start pyspark using the following command : pyspark --packages com.databricks:spark-csv_2.11:1.4.0
and then use 'sc' then it shows spark is not defined. But normally when I start pyspark, it does not show any error regarding 'sc'. – Ishan Aug 08 '16 at 04:26

Converting RDD to spark data frames in python and then accessing a particular values of columns

1 Answers1