XML processing - performance issue

Question

I have a xml document ~10mb in size. It has relatively simple structure but has alot of binary data in it. I need to get data from it and save it in db. Tried jaxb metro - works really slow. I am currently trying jibx for this but unmarshalling a couple of xml docs uses all jvm memory - heap space error - db gets corrupted. Maybe i should use something else for reading xmls? please, give some advice.

Edit My xml represents sort of a message, with information like "to", "from", etc, just strings ints and dates. Biggest part are attached files in byte[], each attachment in its own element. Maybe its possible to load those one by one? I really don't what i should do.

score 2 · Answer 1 · edited May 23 '17 at 12:07

Unludo is right that you need to use STAX to keep this process as efficient as possible - there are actually 5 different ways you can parse XML in Java, I outlined them all here along with pros/cons.

Anything that holds the entire content in ram (DOM or XPath) is going to be too memory intensive. SAX is much better, but it still parses out elements as it hits them and hands them off to your handler implementation while STAX won't parse anything out of the stream until you ask for it; it will only emit events to you to let you know what it is looking at.

That being said, I created the SJXP parsing library built ontop of STAX to provide STAX performance with XPath-ease-of-use.

You literally define paths in the file you are interested in, like:

/message/data -- represents the <message><data>[STUFF HERE]</data></message> path

And then give all the paths (they are basically rules) to the parser then give it the file you want to parse and it does all the dirty work for you, only calling your code when it finds exactly what you asked it for.

The implementation is hyper-efficient (I am not kidding, I spent days profiling it to get the overhead of the implementation BELOW the base STAX classes so it adds no measurable overhead) and super-easy to use.

NOTE You said that your byte[] that come with each message are "separate files", I am not sure what you mean here in the context of XML Parsing; I think a few of us probably assumed your binary data was base64 encoded inside of your XML messages, if that isn't the case and you have auxiliary payloads of data with each message coming over the wire, what you'll want to do to keep memory usage low is to stream that data (a chunk at a time) off the wire directly to your database.

If your database doesn't allow streaming to insert values a segment at a time and needs the entire byte[] blob, then just get that byte[] off the wire and into the DB as soon as possible to keep memory usage low; if those are really 1MB of raw data each then that is likely what is blowing your heap especially if there are a lot of simultaneous connections.

If you want to share more data about your impl I am sure we can help with suggestions.

Thanks for answer, Riyad. I'm going to try your library ASAP, it looks perfect for my needs. For the NOTE part: i actually made a separate question about blobs from xml and memory usage, if you have time, please check it out http://stackoverflow.com/questions/9512646/stax-reading-base64-string-from-xml-into-db#comment12049388_9512646. — emmma1223, Mar 01 '12 at 16:45

score 1 · Answer 2 · answered Feb 25 '12 at 00:15

1

Converting your data from the XML model into the Java model just so you can convert it to the database model feels all wrong to me. Look for tools that support XML to database without going via Java objects - if your database doesn't have XML import, look for a third-party tool. Saxon's XSLT-SQL module probably isn't up to handling binary data, but there are probably tools that are.

answered Feb 25 '12 at 00:15

Michael Kay

147,186
10
83
148

but i need only some data. I don't need to store whole xml document. It goes like this: I receive xml; i need to get some values from some elements; save them in db. – emmma1223 Feb 25 '12 at 09:57

score 1 · Answer 3 · answered Feb 25 '12 at 10:10

1

The simplest approach you could use would be DOM (plenty of examples in Google).

It preloads all the data to build an in-memory tree structure and so it would be fast and since 10 MB is not so big you could try that (of course the in-memory representation would be bigger).

Also DOM is the simplest/easiest perhaps API you could use.

Another library you could try is Simple XML. It is very light and the API looks like JAXB but it is more intuitive and simpler.

If trying these you still feel you need something with less memory demands you could use some stream based parser e.g. Stax but the API is much different and IMO somewhat "harder" to use

answered Feb 25 '12 at 10:10

Cratylus

51,496
63
200
333

There really isn't much difference in terms of complexity of the Simple XML and JAXB (JSR-222) APIs (see http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html). Also since JAXB is the standard Web Service binding layer for both JAX-WS (SOAP) and JAX-RS (RESTful), JAXB is much easier to use in those environments. – bdoughan Feb 25 '12 at 12:01
Thanks for suggestions ! But isn't jaxb uses stax already? For what memory usage i should aim? I profiled jaxb unmarshalling for my xml document, and it uses ~100-150mb of memory; jibx uses 25-50% less. Thanks again. – emmma1223 Feb 25 '12 at 12:35
1

stax use very little memory if you manage your instances correctly (i.e. releasing uneeded ones early). – unludo Feb 25 '12 at 15:33
1

@emmma1223:To be honest I don't know which parser `JAXB` used.My guess is that it would use what ever `JAXP` processor is available – Cratylus Feb 25 '12 at 16:36
@Blaise Doughan:Thanks for the link.I found `Simple` easier to use especially in cases where you want to transform to `String` your data.But broadly speaking you are correct.`JAXB` would be the standard approach – Cratylus Feb 25 '12 at 16:38

unludo · Accepted Answer · 2012-02-25T16:04:44.777

1

You could use stax, it's a good answer for quickly ingesting/generating xml. It's part of jvm now, very simple to use. You will like it :-).

The thing is that you manage clearly each element and attibute as you are reading the file. You do a loop on the elements (start/end) and get easy access to their attributes. It gives you precision on what you want to do. Also not everything is loaded in memory as in DOM.

There are a lot of tutorials online. Here is the first page about it I found on the oracle web site. http://docs.oracle.com/javaee/5/tutorial/doc/bnbem.html

edited Feb 25 '12 at 16:04

answered Feb 25 '12 at 10:25

unludo

4,662
7
43
68

Thanks for suggestion! Should i try woodstox implementation or the default one? – emmma1223 Feb 25 '12 at 17:26
1

I think the default one will do. I don't know about the woodstox one though. – unludo Feb 25 '12 at 18:43
1

@emmma1223 Glad you liked it. This lib helped me a lot ;-) – unludo Mar 01 '12 at 09:00

XML processing - performance issue

4 Answers4