2

I want to access values of a particular column from a data sets that I've read from a csv file. The datasets are stored in pyspark RDD which I want to be converted into the DataFrame. I am using the below code :

from pyspark.sql import SQLContext
sqlc=SQLContext(sc)
df=sc.textFile(r'D:\Home\train.csv')
df=sqlc.createDataFrame(df)

but it is showing error:

Can not infer schema for type: <class 'str'>

First 2 rows of df are :

['"id","product_uid","product_title","search_term","relevance"',
 '2,100001,"Simpson Strong-Tie 12-Gauge Angle","angle bracket",3']

I think the first row is creating this problem. Moreover I want to create data frame which stores the values from 2nd row to last.(Not the first row because it will be the header). How can I achieve this ? I've searched for it but could not find any solution. Thanks in advance.

ahajib
  • 1,075
  • 1
  • 9
  • 15
Ishan
  • 163
  • 1
  • 2
  • 6

1 Answers1

1

To read a csv file to spark dataframe you should use spark-csv. https://github.com/databricks/spark-csv

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

How to use spark csv If you are using pyspark directly from the terminal. Instead of calling

$SPARKHOME/bin/pyspark

You have to use

$SPARKHOME/bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0 

and then use the code above.

If you are using ipython + findspark, you'll have to modify your PYSPARK_SUBMIT_ARGS (before starting ipython)

export PYSPARK_SUBMIT_ARGS=--master local[4] --packages "com.databricks:spark-csv_2.11:1.4.0" pyspark-shell 
phi
  • 163
  • 6
  • Can you tell me how can I use them with pyspark in windows ? I am new to pyspark btw. – Ishan Aug 06 '16 at 06:36
  • It's not really much different in Windows. The arguments to pyspark are still the same, you'll just have a slightly different way of setting the suggested environment variable. Possibly check this question for more, or post a separate question about running pyspark under Windows. – Brian Cline Aug 07 '16 at 23:02
  • I am using iPython with spark, do I have to create an environment variable PYSPARK_SUBMIT_ARGS ? And whenever I start pyspark using the following command : pyspark --packages com.databricks:spark-csv_2.11:1.4.0
    and then use 'sc' then it shows spark is not defined. But normally when I start pyspark, it does not show any error regarding 'sc'.
    – Ishan Aug 08 '16 at 04:26