deal with None when reading JSON in pyspark

Asked May 16 '22 at 10:44

Active May 16 '22 at 10:44

Viewed 29 times

-1

I am trying to read this json (that I get from an api) with pypsark:

[{'DBName': 'db1', 'NameEvent': 'event1', 'status': 'NEVER', 'Date': None}, 
 {'DBName': 'db2','NameEvent': 'event2', 'status': 'ON TIME', 'Date': '2022-05- 
 13T15:09:58.798'}]

to do so, here my code :

import requests
r = requests.get('https://api_to_file.com')   
rdd = sc.parallelize(r.json())
table = spark.read.json(rdd)

the problem is, I have a "_corrupt_record" for every records with a None value for the 'date' key.

Do you know how can I deal with this ?

I already try to add options that I found here https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameReader.json.html , and it still don't works.

Pyspark version +2.4

asked May 16 '22 at 10:44

blabla

I found a solution based on https://stackoverflow.com/questions/41710262/python-replace-none-values-in-nested-json. So I started by loads de json, replaced the None by "" and then read it with pyspark. that's works fine. Is it the best way to do it ? – blabla May 16 '22 at 14:16

deal with None when reading JSON in pyspark

0 Answers0