Modify a struct column in spark dataframe

Question

I have a pyspark dataframe which contains a column "student" as follows:

"student" : {
   "name" : "kaleem",
   "rollno" : "12"
}

Schema for this in dataframe is :

structType(List(
   name: String, 
   rollno: String))

I need to modify this column as

"student" : {
   "student_details" : {
         "name" : "kaleem",
         "rollno" : "12"
   }
}

Schema for this in dataframe must be :

structType(List(
  student_details: 
     structType(List(
         name: String, 
         rollno: String))
))

How to do this in spark?

you json is corrupted. not enough details to re-produce the original its a guess work. pls change and add enough details or have to close the question it self — Ram Ghadiyaram, May 27 '20 at 19:04
There is a misplaced comma in the first JSON, just move it one line up. Otherwise, I think it shows enough intent. — Saša Zejnilović, May 27 '20 at 19:18
@kmkhan, check my answer, please upvote + accept if it helps — Som, May 28 '20 at 14:48

Saša Zejnilović · Answer 1 · 2020-05-27T19:34:07.783

2

With a library called spark-hats - This library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of arbitrary levels of nesting., you can do a lot of these transformations.

scala> import za.co.absa.spark.hats.Extensions._

scala> df.printSchema
root
 |-- ID: string (nullable = true)

scala> val df2 = df.nestedMapColumn("ID", "ID", c => struct(c as alfa))

scala> df2.printSchema
root
 |-- ID: struct (nullable = false)
 |    |-- alfa: string (nullable = true)

scala> val df3 = df2.nestedMapColumn("ID.alfa", "ID.alfa", c => struct(c as "beta"))

scala> df3.printSchema
root
 |-- ID: struct (nullable = false)
 |    |-- alfa: struct (nullable = false)
 |    |    |-- beta: string (nullable = true)

Your query would be

df.nestedMapColumn("student", "student", c => struct(c as "student_details"))

edited May 27 '20 at 19:34

answered May 27 '20 at 18:50

Saša Zejnilović

840
7
16

this is scala. what about pyspark? – mightyMouse May 27 '20 at 19:24
1

Using scala libraries in pyspark is quite common. Look at this [SO post](https://stackoverflow.com/a/36024707/5594180) or other articles [1](https://diogoalexandrefranco.github.io/scala-code-in-pyspark/), [2](https://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html) – Saša Zejnilović May 27 '20 at 19:30

score 1 · Accepted Answer · answered May 28 '20 at 11:26

Use named_struct function to achieve this-

1. Read the json as column

val  data =
      """
        | {
        |   "student": {
        |       "name": "kaleem",
        |       "rollno": "12"
        |   }
        |}
      """.stripMargin
    val df = spark.read.json(Seq(data).toDS())
    df.show(false)
    println(df.schema("student"))

Output-

+------------+
|student     |
+------------+
|[kaleem, 12]|
+------------+

StructField(student,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)

2. change the schema using `named_struct`

val processedDf = df.withColumn("student",
      expr("named_struct('student_details', student)")
    )
    processedDf.show(false)
    println(processedDf.schema("student"))

Output-

+--------------+
|student       |
+--------------+
|[[kaleem, 12]]|
+--------------+

StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)),false)

For python step#2 will work as is just remove val

Modify a struct column in spark dataframe

2 Answers2

1. Read the json as column

2. change the schema using `named_struct`

Linked

Modify a struct column in spark dataframe

2 Answers2

1. Read the json as column

2. change the schema using named_struct

Linked

2. change the schema using `named_struct`