Below is the sample structure of JSON
{
"Docs":[
{
"RunOn":"2020-04-03T04:50:28.1064257Z",
"Version":1,
"Client":"All",
"DatabaseType":"Client",
"IndexName":"DeclarationLogs/Search",
"@metadata":{
"Raven-Entity-Name":"IndexUpdates",
"Raven-Clr-Type":"Cas.Common.Domain.DbModel.IndexUpdate.IndexUpdate, Cas.Root",
"Ensure-Unique-Constraints":[
],
"@id":"IndexUpdates/0001",
"Last-Modified":"2020-04-03T04:50:28.1072484Z",
"Raven-Last-Modified":"2020-04-03T04:50:28.1072484",
"@etag":"01000000-0000-0001-0000-000000000001",
"Non-Authoritative-Information":false
}
}
]
}
This json is ravenDB backup file and very huge. It also have duplicate columns. Like docs which is there in sample is repeated 2-3 times. I need to read this , flatten it and store in synapse. But currently I am not even able to read or process it. I somehow renamed the duplicate columns but as it is very huge in size , it is giving out of memory error. I tried to process it using pyspark. If I try to read as a text then it reads. but if I read as json then it gives error. I tried to read this file with more cluster size then it gives duplicate column error which is difficut to find out as these columns are in nested elements upto 4-5th level. –
Plz suggest the way to atleast split it in smaller files using pyspark or data flow or any other way possible.