2

I would like to create column with sequential numbers in pyspark dataframe starting from specified number. For instance, I want to add column A to my dataframe df which will start from 5 to the length of my dataframe, incrementing by one, so 5, 6, 7, ..., length(df).

Some simple solution using pyspark methods?

max04
  • 3,615
  • 3
  • 12
  • 19
  • 1
    Easiest way is probably `df = df.rdd.zipWithIndex().toDF(cols + ["index"]).withColumn("index", f.col("index") + 5)` where `cols = df.columns` and `f` refers to `pyspark.sql.functions`. But you should ask yourself why you're doing this, bc almost surely there's a better way. DataFrames are inherently unordered, so this operation is not efficient. – pault Jul 06 '18 at 02:07
  • Thank you! At the end I want to add the final results to Hive table. I have to take max(id) from this table and add new records with id starting from max(id) + 1. – max04 Jul 06 '18 at 09:44
  • I do not think it is possible to get a serial id column in Hive like that. Hive/Spark is intended for parallel processing. Even though the code in my comment works for you and you may be able to come up with a way to achieve your desired result, this is not really a good use case for spark or hive. – pault Jul 06 '18 at 13:34
  • I handled it by adding new column to my df like this: `max(id) + spark_func.row_number().over(Window.orderBy(unique_field_in_my_df)` – max04 Jul 11 '18 at 09:47

3 Answers3

2

You can do this using range

df_len = 100
freq =1
ref = spark.range(
    5, df_len, freq
).toDF("id")
ref.show(10)

+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+

only showing top 10 rows

Prathik Kini
  • 964
  • 12
  • 23
niraj kumar
  • 129
  • 7
2

Three simple steps:

from pyspark.sql.window import Window

from pyspark.sql.functions import monotonically_increasing_id,row_number

df =df.withColumn("row_idx",row_number().over(Window.orderBy(monotonically_increasing_id())))

younus
  • 310
  • 1
  • 7
  • 18
0

This worked for me. This creates sequential value into the column.

seed = 23
df.withColumn('label', seed+dense_rank().over(Window.orderBy('column')))
gdupont
  • 1,611
  • 18
  • 26