Proper way to store a dict as a column value in parquet format using pyspark

Question

I have a requirement to store a nested list of json objects in a column by doing a JOIN between two datasets related by one-to-many relation. Example: stackoverflow posts (each question can have one or many answers), answers should be populated against each question as a nested list of dict.

I am able to achieve this perfectly using pandas. I can store the output as parquet and also load it back again using pandas. However, due to performance reasons I am using pyspark.

But, when I store the nested list of objects column using pyspark, I am not able to load it back using pandas, which makes me wonder if I can store it differently such a way that I can load it back using pandas.

The error I get is this:

pyarrow.lib.ArrowNotImplementedError: Not implemented type for Arrow list to pandas: map<string, string>

If I convert the pyspark dataframe to pandas and store it as parquet file, I don't run into this error. So, I believe parquet has the necessary support to be able to store the data that way I want (so that I can load these list(dict) columns using pandas).

Below is my pyspark code

        answers_dicted = answers_df.withColumn("dict",
        create_map(create_map_args(answers_df))
    ).select(["Id", "ParentId", "dict"])
# group answers by questions
print(&quot;Grouping answers by questions&quot;)
answers_grouped = answers_dicted.groupby(&quot;ParentId&quot;).agg(collect_list(&quot;dict&quot;).alias(&quot;answers&quot;))

# populate questions with answers 
print(&quot;Adding answers to questions&quot;)
questions_df = questions_df.join(answers_grouped, questions_df.Id == answers_grouped.ParentId, &quot;left&quot;).select(questions_df[&quot;*&quot;], answers_grouped[&quot;answers&quot;])

Proper way to store a dict as a column value in parquet format using pyspark

0 Answers0