How to change the order of columns in pyspark dataframe?

Question

I have pyspark dataframe which contain supervised data. In my dataframe, label attribute can be present at any position. I want to move the label attribute to the last in dataframe. for ex., suppose attributes in my dataframe are present like ['age','gender','defaulter','salary','occupation']. In this 'defaulter' is the label attribute. I want to move this attribute in last so that my dataframe contain column in this order ['age','gender','salary','occupation','defaulter']. I want to do this because when I want to apply ML algos such as logistic regression in this data, I have to convert it into RDD and extract the last value (or first value) as a label point (https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/logistic_regression.py).

Possible duplicate of [Python Pandas - Re-ordering columns in a dataframe based on column name](https://stackoverflow.com/questions/11067027/python-pandas-re-ordering-columns-in-a-dataframe-based-on-column-name) — charlesreid1, Sep 21 '17 at 09:43

score 0 · Answer 1 · answered Sep 21 '17 at 11:11

If you run ML algorithms on Dataframes, consider using VectorAssembler to create features array. Like this:

assembler = VectorAssembler(
    inputCols= ['age','gender','salary','occupation'],
    outputCol="features")

input_rdd = assembler.transform(dataframe) \
    .map(lambda row: LabeledPoint(row.defaulter, row.features))

How to change the order of columns in pyspark dataframe?

1 Answers1