How to add new column not based on exist column in dataframe with Scala/Spark?

Question

I have a DataFrame and I want to add a new column but not based on exit column,what should I do?

This is my dataframe:

+----+
|time|
+----+
|   1|
|   4|
|   3|
|   2|
|   5|
|   7|
|   3|
|   5|
+----+

This is my expect result:

+----+-----+  
|time|index|  
+----+-----+  
|   1|    1|  
|   4|    2|  
|   3|    3|  
|   2|    4|  
|   5|    5|  
|   7|    6|  
|   3|    7|  
|   5|    8|  
+----+-----+

score 1 · Answer 1 · answered Jul 21 '17 at 06:20

1

use rdd zipWithIndex may be what you want.

val newRdd = yourDF.rdd.zipWithIndex.map{case (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)}
val schema = StructType(Array(StructField("time", IntegerType, nullable = true), StructField("index", LongType, nullable = true)))
val newDF = spark.createDataFrame(newRdd, schema)
newDF.show
+----+-----+                                                                    
|time|index|
+----+-----+
|   1|    0|
|   4|    1|
|   3|    2|
|   2|    3|
|   5|    4|
|   7|    5|
|   3|    6|
|   8|    7|
+----+-----+

I assume Your time column is IntegerType here.

answered Jul 21 '17 at 06:20

neilron

158
4

In your way,I have to change DataFrame to rdd ,then change rdd to DataFrame,it is inefficient – mentongwu Jul 21 '17 at 06:21
1

I'm not sure that it's the best solution, but there should be no serious performance issue. @mentongwu – neilron Jul 21 '17 at 06:28
converting to rdd wont be serious performance issue. but dataframe uses tugsten format where as rdd wont. – Ram Ghadiyaram Jul 21 '17 at 07:39

score 0 · Answer 2 · answered Jul 24 '17 at 09:19

Rather using Window function and converting to rdd and using zipWithIndex are slower, you can use a built in function monotonically_increasing_id as

import org.apache.spark.sql.functions._
df.withColumn("index", monotonically_increasing_id())

Hope this hepls!

How to add new column not based on exist column in dataframe with Scala/Spark?

2 Answers2