Convert a column in Spark's dataframe into an array with pyspark

Question

How can I convert a column in a Spark dataframe from string to array? I need this because the fpgroth algorithm needs an array to create a model.

df = spark.read.csv('kheiro/Stage/Model/itemsets.csv')
df.show()

Data:

The type of the column:

And here is the error:

i mean the split inbuilt function for dataframes. If you need more help please post the text input data and the code you've tried in text format and not in image format — Ramesh Maharjan, Mar 25 '18 at 08:32
Please don't post your code as screenshots. People might want to copy&paste it (also search engines will have a hard time finding your post) — Neuron - Freedom for Ukraine, Mar 26 '18 at 21:26

score -1 · Answer 1 · answered Mar 25 '18 at 18:16

-1

In Pyspark for mllib library you need to convert all the features into a single feature vector. You can do the same using a Vector Assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorindexer

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=inputColumnsList,outputCol='features')
assembler.transform(df)

Where inputColsList contains a list can be a single column u want to convert or multiple columns to be converted

answered Mar 25 '18 at 18:16

pratiklodha

That is not correct answer. Unlike many other `pyspark.ml` `Estimators`, `pyspark.ml.fpm.FPGrowth` doesn't take `VectorUDT` input. – Alper t. Turker May 20 '18 at 20:15

1 Answers1