-2

I have pyspark dataframe which contain supervised data. In my dataframe, label attribute can be present at any position. I want to move the label attribute to the last in dataframe. for ex., suppose attributes in my dataframe are present like ['age','gender','defaulter','salary','occupation']. In this 'defaulter' is the label attribute. I want to move this attribute in last so that my dataframe contain column in this order ['age','gender','salary','occupation','defaulter']. I want to do this because when I want to apply ML algos such as logistic regression in this data, I have to convert it into RDD and extract the last value (or first value) as a label point (https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/logistic_regression.py).

neha
  • 1,604
  • 3
  • 19
  • 31
  • Possible duplicate of [Python Pandas - Re-ordering columns in a dataframe based on column name](https://stackoverflow.com/questions/11067027/python-pandas-re-ordering-columns-in-a-dataframe-based-on-column-name) – charlesreid1 Sep 21 '17 at 09:43

1 Answers1

0

If you run ML algorithms on Dataframes, consider using VectorAssembler to create features array. Like this:

assembler = VectorAssembler(
    inputCols= ['age','gender','salary','occupation'],
    outputCol="features")

input_rdd = assembler.transform(dataframe) \
    .map(lambda row: LabeledPoint(row.defaulter, row.features))
Mariusz
  • 12,213
  • 3
  • 51
  • 62