12

Dear Stackoverflow community,

I have a 34 GB json file that has many data inside. I tried to import into my mongodb by using mongoimport --file file.json - but it failed ofcourse the file is too big and threw a memory system throw error you know it. Is it possible to use php code to iterate through the file with a cursor? I have zero experience on this, someone told me that would be possible. I want to know how the file is build, but I do not know how to view an example array of it. From the source I could get an example array:

{
     "_id": ObjectId("53b29644aafd413977b23b7e"),
     "summonerId": NumberLong(24570940),
     "region": "euw",
     "updatedAt": NumberLong(1404212804),
     "season": NumberLong(4),
     "stats": {
         "110": {
             "totalSessionsPlayed": NumberLong(3),
             "totalSessionsLost": NumberLong(2),
             "totalSessionsWon": NumberLong(1),
             "totalChampionKills": NumberLong(34),
             "totalDamageDealt": NumberLong(415051),
             "totalDamageTaken": NumberLong(63237),
             "mostChampionKillsPerSession": NumberLong(12),
             "totalMinionKills": NumberLong(538),
             "totalDoubleKills": NumberLong(5),
             "totalTripleKills": NumberLong(1),
             "totalDeathsPerSession": NumberLong(18),
             "totalGoldEarned": NumberLong(40977),
             "totalTurretsKilled": NumberLong(6),
             "totalPhysicalDamageDealt": NumberLong(381668),
             "totalMagicDamageDealt": NumberLong(31340),
             "totalAssists": NumberLong(25),
             "maxChampionsKilled": NumberLong(12),
             "maxNumDeaths": NumberLong(10)
         }
     }
 }

The field stats contains more arrays, 110 is just an example. How can I iterate through this big sized file or how can I import it into my mongodb? For example; I want to echo summonerid,championid (which is 110 in this case),totalSessionsPlayed. It has to reloop as much as it needs until theres no championid left for this particular summonerid.

Again... A summonerID has a list of champions that it has been playing in his playing career. Champions are referring to (in this example) 110. Every single summonerid can contain multiple champions and I want to have all champions, how many times the champion has been played (totalsessionplayed) by summonerid.

floppy
  • 303
  • 2
  • 13
  • Consider exactly how much memory your PHP would need to load and parse this file (at least double the size of the file), then ask yourself if you can make that much memory available to PHP – Mark Baker Oct 15 '14 at 22:37
  • the file is 34gb... php can never parse it at once i think – floppy Oct 15 '14 at 22:42
  • did you check the net?? check php-streaming-json-parser -- https://github.com/salsify/jsonstreamingparser – Tasos Oct 15 '14 at 23:08
  • 2
    I did check but thats chinese for me. Even after trying to evaluate the examples given, i have no clue how to use it :x – floppy Oct 15 '14 at 23:28
  • Little late but this is also a convenient solution: https://github.com/halaxa/json-machine – Filip Halaxa Nov 29 '18 at 14:41

1 Answers1

16

You'll want to use a streaming parser. These only pull small portions of your file into memory at a time.

They come in a couple different flavors: SAX-like push parsers, and pull parsers. XML reader models: SAX versus XML pull parser gives an overview of the difference.


Push Parser

This is a quick example using salsify/json-streaming-parser.

As it rolls through the file we'll keep track of the summonerId, championId, and state. It's all event-based - you don't get random access with a sequential parser so you have to keep track of things yourself. Every time a totalSessionsPlayed comes up it'll echo out the summonerId, championId, and totalSessionsPlayed.


data.json

This is a paired-down json file for demonstration purposes.

[
    {
        "_id": "53b29644aafd413977b23b7e",
        "summonerId": 24570940,
        "region": "euw",
        "stats": {
            "110": {
                "totalSessionsPlayed": 3,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            },
            "112": {
                "totalSessionsPlayed": 45,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            }
        }
    },
    {
        "_id": "asdfasdfasdf",
        "summonerId": 555555,
        "region": "euw",
        "stats": {
            "42": {
                "totalSessionsPlayed": 65,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            },
            "88": {
                "totalSessionsPlayed": 99,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            }
        }
    }
]

Example:

class ListMatchUps extends JsonStreamingParser\Listener\IdleListener
{

    private $key;
    private $summonerId;
    private $championId;
    private $inStats;

    public function start_document()
    {
        $this->key        = null;
        $this->summonerId = null;
        $this->championId = null;
        $this->inStats    = false;
    }

    public function start_object()
    {
        if ($this->key === 'stats') {
            $this->inStats = true;
        } else if ($this->inStats) {
            $this->championId = $this->key;
        }
    }

    public function end_object()
    {
        if ($this->championId !== null) {
            $this->championId = null;
        } else if ($this->inStats) {
            $this->inStats = false;
        } else {
            $this->summonerId = null;
        }
    }

    public function key($key)
    {
        $this->key = $key;
    }

    public function value($value)
    {
        switch ($this->key) {
            case 'summonerId':
                $this->summonerId = $value;
                break;
            case 'totalSessionsPlayed':
                echo "{$this->summonerId},{$this->championId},$value\n";
                break;
        }
    }
}

$stream = fopen('data.json', 'r');
$listener = new ListMatchUps();
try {
    $parser = new JsonStreamingParser_Parser($stream, $listener);
    $parser->parse();
} catch (Exception $e) {
    fclose($stream);
    throw $e;
}

Output:

24570940,110,3
24570940,112,45
555555,42,65
555555,88,99

Pull Parser

This is using a parser I recently wrote, pcrov/jsonreader (requires PHP 7.)

Same data.json as above.

Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while($reader->read("summonerId")) {
    $summonerId = $reader->value();
    $reader->next("stats");
    foreach($reader->value() as $championId => $stats) {
        echo "$summonerId, $championId, {$stats['totalSessionsPlayed']}\n";
    }
}
$reader->close();

Output:

24570940, 110, 3
24570940, 112, 45
555555, 42, 65
555555, 88, 99
user3942918
  • 24,679
  • 11
  • 53
  • 67
  • This is perfect. Do you have any idea how we can tell the parser to stop early when it's done with a certain array/object or a certain key name? The issue with the current implementation is that it seems to scan the entire json file before it stops. I tried calling end_document in the listener to stop, but the script just continues to parse. – Lok Yan Wong Jan 25 '16 at 10:01