How to transpose dynamic columns with rows of different size in PySpark?
I'm phrasing jsons to spark dataframe with data structure below:
- Static names:
data,label,units(can be hard-coded) - Dynamic names:
data_1_a,data_1_b,name_1_a,name_1_b,name_10000_xyz(cannot be hard-coded as they are up to 10000 names / data sub categories)
I tried solutions below in useful links section, however they deal with transformations where columns rows have the same size.
Input df1:
root
|-- id: long (nullable = true)
|-- data_1_a: struct (nullable = true)
| |-- name_1_a: struct (nullable = true)
| | |-- label: string (nullable = true)
| | |-- unit: string (nullable = true)
|-- data_1_b: struct (nullable = true)
| |-- name_10000_xyz: struct (nullable = true)
| | |-- label: string (nullable = true)
| |-- name_1_b: struct (nullable = true)
| | |-- label: string (nullable = true)
+---+-----------------------+--------------------------+
|id |data_1_a |data_1_b |
+---+-----------------------+--------------------------+
|1 |{{label_1_a, unit_1_a}}|{{10000_xyz}, {label_1_b}}|
+---+-----------------------+--------------------------+
Required output df:
+---+---------+----------------+-----------+----------+
|id |data | name | label | unit |
+---+---------+----------------+-----------+----------+
|1 |data_1_a | name_1_a | label_1_a | unit_1_a |
|1 |data_1_b | name_1_b | label_1_b | null |
|1 |data_1_b | name_10000_xyz | 10000_xyz | null |
+---+---------+----------------+-----------+----------+
To reproduce input df:
json1 = """{"id":1,"data":{"data_1_a":{"name_1_a":{"label":"label_1_a","unit":"unit_1_a"}},"data_1_b":{"name_1_b":{"label":"label_1_b"},"name_10000_xyz":{"label":"10000_xyz"}}}}"""
df = spark.read.json(sc.parallelize([json1]))
df1 = df.select('id','data.*')
Useful links: