Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

129 questions
14
votes
4 answers

Import csv file contents into pyspark dataframes

How can I import a .csv file into pyspark dataframes? I even tried to read csv file in Pandas and then convert it to a spark dataframe using createDataFrame, but it is still showing some error. Can someone guide me through this? Also, please tell me…
neha
  • 141
  • 1
  • 1
  • 4
7
votes
3 answers

Pyspark coverting timestamps from UTC to many timezones

This is using python with Spark 1.6.1 and dataframes. I have timestamps in UTC that I want to convert to local time, but a given row could be in any of several timezones. I have an 'offset' value (or alternately, the local timezone abbreviation. I…
Eric Hilton
  • 71
  • 1
  • 1
  • 2
6
votes
2 answers

Why is there a difference of "ML" vs "MLLIB" in Apache Spark's documentation?

I am trying to figure out which pyspark library to use with Word2Vec and I'm presented with two options according to the pyspark…
Gabriel Fair
  • 257
  • 3
  • 8
4
votes
1 answer

Spark development on local machine with PyCharm

I am wondering what is the best practice other devs are using for their python spark jobs. I am building a dev environment in which I am looking to write code in PyCharm with SparkContext pointing to a standalone cluster and being able to run my…
TheWiz
  • 41
  • 2
2
votes
1 answer

Converting RDD to spark data frames in python and then accessing a particular values of columns

I want to access values of a particular column from a data sets that I've read from a csv file. The datasets are stored in pyspark RDD which I want to be converted into the DataFrame. I am using the below code : from pyspark.sql import…
Ishan
  • 163
  • 1
  • 2
  • 6
1
vote
0 answers

Proper way to store a dict as a column value in parquet format using pyspark

I have a requirement to store a nested list of json objects in a column by doing a JOIN between two datasets related by one-to-many relation. Example: stackoverflow posts (each question can have one or many answers), answers should be populated…
vangap
  • 111
  • 2