4

I have a DataFrame with a column of string type, this string is a JSON format, I wanted to convert this column to multiple columns based on this JSON format. I can do it if I have the JSON schema, but I don't have it.

Example:

Original Dataframe:

---------------------
|        json_string|
---------------------
|{"a":2,"b":"hello"}|
|   {"a":1,"b":"hi"}|
---------------------

After Conversion/Parse

--------------
|  a |     b |
--------------
|  2 |  hello|
|  1 |     hi|
--------------

I using Apache Spark 2.1.1.

zero323
  • 305,283
  • 89
  • 921
  • 912
Clairton Menezes
  • 85
  • 1
  • 1
  • 7

1 Answers1

16

If you do not have a predefined schema the other option is to convert it to RDD[String] or Dataset[String] and load as a json

Here is how you can do

//convert to RDD[String]
val rdd = originalDF.rdd.map(_.getString(0))

val ds = rdd.toDS

Now load as a json

val df = spark.read.json(rdd) // or spark.read.json(ds)

df.show(false)

Also use json(ds), json(rdd) is deprecated from 2.2.0

@deprecated("Use json(Dataset[String]) instead.", "2.2.0")

Output:

+---+-----+
|a  |b    |
+---+-----+
|2  |hello|
|1  |hi   |
+---+-----+
koiralo
  • 21,620
  • 4
  • 46
  • 70