I'm trying to ingest some mongo collections to big query using pyspark. The schema looks like this.
root
|-- groups: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- my_field: struct (nullable = true)
| | | |-- **{ mongo id }**: struct (nullable = true)
| | | | |-- A: timestamp (nullable = true)
| | | | |-- B: string (nullable = true)
| | | | |-- C: struct (nullable = true)
| | | | | |-- abc: boolean (nullable = true)
| | | | | |-- def: boolean (nullable = true)
| | | | | |-- ghi: boolean (nullable = true)
| | | | | |-- xyz: boolean (nullable = true)
The issue is that inside my_field we store the id, each group has it's own id and when I import everything to big query I end up having a new column for each id. I want to convert my_field to a string and store all the nested fields as a json or something like that. But when I try to convert it I'm getting this error
temp_df = temp_df.withColumn("groups.my_field", col("groups.my_field").cast('string'))
TypeError: Column is not iterable
What am I missing?