0

Below is the sample structure of JSON

{
   "Docs":[
      {
         "RunOn":"2020-04-03T04:50:28.1064257Z",
         "Version":1,
         "Client":"All",
         "DatabaseType":"Client",
         "IndexName":"DeclarationLogs/Search",
         "@metadata":{
            "Raven-Entity-Name":"IndexUpdates",
            "Raven-Clr-Type":"Cas.Common.Domain.DbModel.IndexUpdate.IndexUpdate, Cas.Root",
            "Ensure-Unique-Constraints":[
               
            ],
            "@id":"IndexUpdates/0001",
            "Last-Modified":"2020-04-03T04:50:28.1072484Z",
            "Raven-Last-Modified":"2020-04-03T04:50:28.1072484",
            "@etag":"01000000-0000-0001-0000-000000000001",
            "Non-Authoritative-Information":false
         }
      }
   ]
} 

This json is ravenDB backup file and very huge. It also have duplicate columns. Like docs which is there in sample is repeated 2-3 times. I need to read this , flatten it and store in synapse. But currently I am not even able to read or process it. I somehow renamed the duplicate columns but as it is very huge in size , it is giving out of memory error. I tried to process it using pyspark. If I try to read as a text then it reads. but if I read as json then it gives error. I tried to read this file with more cluster size then it gives duplicate column error which is difficut to find out as these columns are in nested elements upto 4-5th level. –

Plz suggest the way to atleast split it in smaller files using pyspark or data flow or any other way possible.

Steven
  • 11,973
  • 4
  • 33
  • 66
  • Welcome to SO. Instead of loading the entire file, you could use json stream reading libraries, so that you would get chance to process at start of each object. see a C# example in this [answer](https://stackoverflow.com/a/43747641/14973743). another alternate using python is this [approach](https://stackoverflow.com/questions/31975345/python-how-to-stream-large-11-gb-json-file-to-be-broken-up) – Anand Sowmithiran Nov 26 '21 at 18:17

0 Answers0