Want to Read json file whose size in GB(5-10) and has duplicate columns in nested elements

Question

Below is the sample structure of JSON

{
   "Docs":[
      {
         "RunOn":"2020-04-03T04:50:28.1064257Z",
         "Version":1,
         "Client":"All",
         "DatabaseType":"Client",
         "IndexName":"DeclarationLogs/Search",
         "@metadata":{
            "Raven-Entity-Name":"IndexUpdates",
            "Raven-Clr-Type":"Cas.Common.Domain.DbModel.IndexUpdate.IndexUpdate, Cas.Root",
            "Ensure-Unique-Constraints":[
               
            ],
            "@id":"IndexUpdates/0001",
            "Last-Modified":"2020-04-03T04:50:28.1072484Z",
            "Raven-Last-Modified":"2020-04-03T04:50:28.1072484",
            "@etag":"01000000-0000-0001-0000-000000000001",
            "Non-Authoritative-Information":false
         }
      }
   ]
}

This json is ravenDB backup file and very huge. It also have duplicate columns. Like docs which is there in sample is repeated 2-3 times. I need to read this , flatten it and store in synapse. But currently I am not even able to read or process it. I somehow renamed the duplicate columns but as it is very huge in size , it is giving out of memory error. I tried to process it using pyspark. If I try to read as a text then it reads. but if I read as json then it gives error. I tried to read this file with more cluster size then it gives duplicate column error which is difficut to find out as these columns are in nested elements upto 4-5th level. –

Plz suggest the way to atleast split it in smaller files using pyspark or data flow or any other way possible.

Welcome to SO. Instead of loading the entire file, you could use json stream reading libraries, so that you would get chance to process at start of each object. see a C# example in this [answer](https://stackoverflow.com/a/43747641/14973743). another alternate using python is this [approach](https://stackoverflow.com/questions/31975345/python-how-to-stream-large-11-gb-json-file-to-be-broken-up) — Anand Sowmithiran, Nov 26 '21 at 18:17

Want to Read json file whose size in GB(5-10) and has duplicate columns in nested elements

0 Answers0