5

I'd like to enumerate grouped values just like with Pandas:

Enumerate each row for each group in a DataFrame

What is a way in Spark/Python?

Community
  • 1
  • 1
Gerenuk
  • 11,281
  • 17
  • 53
  • 87

2 Answers2

4

With row_number window function:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
zero323
  • 305,283
  • 89
  • 921
  • 912
1

You can achieve this on rdd level by doing:

rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()

It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId() which is more efficient, since done locally on each partition.

Elior Malul
  • 649
  • 6
  • 8