How can I enumerate rows in groups with Spark/Python?

Question

I'd like to enumerate grouped values just like with Pandas:

Enumerate each row for each group in a DataFrame

What is a way in Spark/Python?

score 4 · Accepted Answer · answered Mar 09 '16 at 14:04

4

With row_number window function:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))

answered Mar 09 '16 at 14:04

zero323

305,283
89
921
912

I suppose it works - just my Spark is too old :) I need a new Spark now. – Gerenuk Mar 09 '16 at 16:55
It looks I've missed part of the conversation but window functions require Spark 1.4+. – zero323 Mar 09 '16 at 20:41
row_number even 1.6.0. So I have a reason to get that. – Gerenuk Mar 10 '16 at 07:14

score 1 · Answer 2 · answered Nov 11 '17 at 10:20

You can achieve this on rdd level by doing:

rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()

It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId() which is more efficient, since done locally on each partition.

How can I enumerate rows in groups with Spark/Python?

2 Answers2