I have pyspark dataframe which contain supervised data. In my dataframe, label attribute can be present at any position. I want to move the label attribute to the last in dataframe. for ex., suppose attributes in my dataframe are present like ['age','gender','defaulter','salary','occupation']. In this 'defaulter' is the label attribute. I want to move this attribute in last so that my dataframe contain column in this order ['age','gender','salary','occupation','defaulter']. I want to do this because when I want to apply ML algos such as logistic regression in this data, I have to convert it into RDD and extract the last value (or first value) as a label point (https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/logistic_regression.py).
Asked
Active
Viewed 2,810 times
1 Answers
0
If you run ML algorithms on Dataframes, consider using VectorAssembler to create features array. Like this:
assembler = VectorAssembler(
inputCols= ['age','gender','salary','occupation'],
outputCol="features")
input_rdd = assembler.transform(dataframe) \
.map(lambda row: LabeledPoint(row.defaulter, row.features))
Mariusz
- 12,213
- 3
- 51
- 62