Pyspark: cast array with nested struct to string

Question

I have pyspark dataframe with a column named Filters: "array>"

I want to save my dataframe in csv file, for that i need to cast the array to string type.

I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters:

org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19

The code is as follows

from pyspark.sql.types import StringType

DF.printSchema()

|-- ClientNum: string (nullable = true)
|-- Filters: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Op: string (nullable = true)
          |-- Type: string (nullable = true)
          |-- Val: string (nullable = true)

DF_cast = DF.select ('ClientNum',DF.Filters.cast(StringType())) 

DF_cast.printSchema()

|-- ClientNum: string (nullable = true)
|-- Filters: string (nullable = true)

DF_cast.show()

| ClientNum | Filters 
|  32103    | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d9e517ce
|  218056   | org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3c744494

Sample JSON data:

{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}

Thanks !!

Can you print schema and show data before the transformation. Also print schema after the transformation. — Abhishek Bansal, Apr 11 '17 at 13:30
M not able to recreate the issue. Can you show data before the transformation. — Abhishek Bansal, Apr 11 '17 at 13:50

Garren S · Accepted Answer · 2017-10-05T16:05:19.703

I created a sample JSON dataset to match that schema:

{"ClientNum":"abc123","Filters":[{"Op":"foo","Type":"bar","Val":"baz"}]}

select(s.col("ClientNum"),s.col("Filters").cast(StringType)).show(false)

+---------+------------------------------------------------------------------+
|ClientNum|Filters                                                           |
+---------+------------------------------------------------------------------+
|abc123   |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@60fca57e|
+---------+------------------------------------------------------------------+

Your problem is best solved using the explode() function which flattens an array, then the star expand notation:

s.selectExpr("explode(Filters) AS structCol").selectExpr("structCol.*").show()
+---+----+---+
| Op|Type|Val|
+---+----+---+
|foo| bar|baz|
+---+----+---+

To make it a single column string separated by commas:

s.selectExpr("explode(Filters) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("single_col")).show()
+-----------+
| single_col|
+-----------+
|foo,bar,baz|
+-----------+

Explode Array reference: Flattening Rows in Spark

Star expand reference for "struct" type: How to flatten a struct in a spark dataframe?

This creates columns in the top structure rather than a single column with the contents of all the columns as a string — alfredox, Oct 04 '17 at 20:47

Vzzarr · Answer 2 · 2021-05-12T09:48:55.283

For me in Pyspark the function to_json() did the job.

As a plus compared to the simple casting to String, it keeps the "struct keys" as well (not only the "struct values"). So for the reported example I would have something like:

[{"Op":"foo","Type":"bar","Val":"baz"}]

This was much more useful to me since that I had to write results to a Postgres table. In this format I can easily use supported JSON functions in Postgres

score -2 · Answer 3 · answered Apr 11 '17 at 13:26

-2

You can try this:

DF = DF.withColumn('Filters', DF.Filters.cast("string"))

answered Apr 11 '17 at 13:26

Abhishek Bansal

1,215
7
8

1

I tried, same result : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@3 – Omar14 Apr 11 '17 at 13:28
I'd say you have to run UDF where you can apply some logic to convert array to string and then select new column – iurii_n Apr 11 '17 at 16:11

Pyspark: cast array with nested struct to string

3 Answers3

Linked