0

How to transpose dynamic columns with rows of different size in PySpark?

I'm phrasing jsons to spark dataframe with data structure below:

  • Static names: data, label, units(can be hard-coded)
  • Dynamic names: data_1_a, data_1_b , name_1_a , name_1_b, name_10000_xyz (cannot be hard-coded as they are up to 10000 names / data sub categories)

I tried solutions below in useful links section, however they deal with transformations where columns rows have the same size.

Input df1:

root
 |-- id: long (nullable = true)
 |-- data_1_a: struct (nullable = true)
 |    |-- name_1_a: struct (nullable = true)
 |    |    |-- label: string (nullable = true)
 |    |    |-- unit: string (nullable = true)
 |-- data_1_b: struct (nullable = true)
 |    |-- name_10000_xyz: struct (nullable = true)
 |    |    |-- label: string (nullable = true)
 |    |-- name_1_b: struct (nullable = true)
 |    |    |-- label: string (nullable = true)
    
    
+---+-----------------------+--------------------------+
|id |data_1_a               |data_1_b                  |
+---+-----------------------+--------------------------+
|1  |{{label_1_a, unit_1_a}}|{{10000_xyz}, {label_1_b}}|
+---+-----------------------+--------------------------+

Required output df:

+---+---------+----------------+-----------+----------+
|id |data     |  name          |   label    | unit    |
+---+---------+----------------+-----------+----------+
|1  |data_1_a | name_1_a       | label_1_a | unit_1_a |
|1  |data_1_b | name_1_b       | label_1_b | null     |
|1  |data_1_b | name_10000_xyz | 10000_xyz | null     |
+---+---------+----------------+-----------+----------+

To reproduce input df:

json1 = """{"id":1,"data":{"data_1_a":{"name_1_a":{"label":"label_1_a","unit":"unit_1_a"}},"data_1_b":{"name_1_b":{"label":"label_1_b"},"name_10000_xyz":{"label":"10000_xyz"}}}}"""
df = spark.read.json(sc.parallelize([json1]))
df1 = df.select('id','data.*')

Useful links:

Dan
  • 407
  • 2
  • 16

0 Answers0