Quantcast

[HEADS UP] - Splitting big XML files using XPath

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

[HEADS UP] - Splitting big XML files using XPath

Claus Ibsen-2
Hi

I recently had a look at improving the XSLT, XQuery and XPath
components in Camel.

For example these first two of these components now supports StAX as Source.
And prefer StAX/SAX over DOM etc. For StAX you will need to enable it
using allowStAX option (to be backwards compatible)

The latter (XPath) does not support this, because its javax API is limited.
Likewise the XPath engine in the JDK does not support streaming, so we
end up loading the content into a DOM in memory.

So this means that when people are trying to split a big XML file with
XPath in Camel, they hit OOME or have a solution that eats up memory
and the system becomes slower.

The solution is to build a custom expression that will iterate the
file source in pieces and do the "XPath splitting" manually.
So I have enhanced the tokenizer language in Camel so it can do this for you.

See the sections:
- strem based
- streaming big XML payloads using Tokenizer language
at http://camel.apache.org/splitter

The idea is that you provide a start and end token, and then the
tokenizer will chop the payload by grabbing the content between those
tokens.
All in a streamed fashion using the java.util.Scanner from the JDK.

I added some unit tests to simulate big data and to output performance
in camel-core
- TokenPairIteratorSplitChoicePerformanceTest
- XPathSplitChoicePerformanceTest

As well in camel-saxon we have a unit test as well
- XPathSplitChoicePerformanceTest

I noticed Saxon is faster than the JDK XPath engine, but they both eat
up memory. I looked at Saxon and they are starting to support
streaming but only in their EE version (which you need to buy a
license for) and the streaming seems to be XSTL specific at first.
(Not XPath).

I also added a INFO logging in the XPathBuilder so it logs once when
it initializes the XPathFactory. This allows you to know which factory
is used
INFO  XPathBuilder                   - Created default XPathFactory
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f
For example if you have Saxon on the classpath it may use that instead.

For example to split 40.000 elements using the JDK XPath Engine
- Processed file with 40000 elements in: 45.521 seconds   (uses about 98mb)

And 40.000 elements with the tokenizer
- Processed file with 40000 elements in: 47.291 seconds   (uses about 6mb)

And 200.000 elements with the tokenizer
- Processed file with 200000 elements in: 3 minutes   (uses about 14mb)

I could not run the 200.000 elements with XPath as it hit OOME (unless
I bump up the JVM memory allocations a lot)

So its not really about speed, but about memory usages. The tokenizer
is very low memory usages, where as XPath will just keep eating
memory.
Now if the XML data was very big then only the tokenizer would be able
to split the file.

The tokenizer is of course not using a real XPath expression, so you
can only split by chopping out a "record" of you XML file.
But if you structure your XML data as follows, then the tokenizer can handle it:
<records>
  <record id="1">
  </record>
  <record id="2">
  </record>
  <record id="3">
  </record>
   ....
  <record id="N">
  </record>
</records>

Also the tokenizer can support non XML as well, in case you have
special START/END tokens for your records.


What about other XPath libraries?
Yes there is a few out there. Some is not so active maintained (I
guess some the XML hyper is over now) and others have a GPL license or
other kind
of license that prevents us to use it at Apache
http://www.apache.org/legal/3party.html#define-thirdpartywork



--
Claus Ibsen
-----------------
FuseSource
Email: [hidden email]
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [HEADS UP] - Splitting big XML files using XPath

Claus Ibsen-2
Romain have worked on a StAX expression iterator which allows to split
big XML files as well but using the JAXB/StAX API.
https://issues.apache.org/jira/browse/CAMEL-3966

This requires end users to have model classes with JAXB annotations,
which you then use as matcher in the iterator.
So you would have a Records and Record classes with JAXB annotations.

This would also be a solution but is of course pure XML based as well
requires model classes. However I like this approach.
And could be a base for a StAXBuilder that Christian Mueller have
proposed in https://issues.apache.org/jira/browse/CAMEL-3998



On Sun, Oct 30, 2011 at 10:50 AM, Claus Ibsen <[hidden email]> wrote:

> Hi
>
> I recently had a look at improving the XSLT, XQuery and XPath
> components in Camel.
>
> For example these first two of these components now supports StAX as Source.
> And prefer StAX/SAX over DOM etc. For StAX you will need to enable it
> using allowStAX option (to be backwards compatible)
>
> The latter (XPath) does not support this, because its javax API is limited.
> Likewise the XPath engine in the JDK does not support streaming, so we
> end up loading the content into a DOM in memory.
>
> So this means that when people are trying to split a big XML file with
> XPath in Camel, they hit OOME or have a solution that eats up memory
> and the system becomes slower.
>
> The solution is to build a custom expression that will iterate the
> file source in pieces and do the "XPath splitting" manually.
> So I have enhanced the tokenizer language in Camel so it can do this for you.
>
> See the sections:
> - strem based
> - streaming big XML payloads using Tokenizer language
> at http://camel.apache.org/splitter
>
> The idea is that you provide a start and end token, and then the
> tokenizer will chop the payload by grabbing the content between those
> tokens.
> All in a streamed fashion using the java.util.Scanner from the JDK.
>
> I added some unit tests to simulate big data and to output performance
> in camel-core
> - TokenPairIteratorSplitChoicePerformanceTest
> - XPathSplitChoicePerformanceTest
>
> As well in camel-saxon we have a unit test as well
> - XPathSplitChoicePerformanceTest
>
> I noticed Saxon is faster than the JDK XPath engine, but they both eat
> up memory. I looked at Saxon and they are starting to support
> streaming but only in their EE version (which you need to buy a
> license for) and the streaming seems to be XSTL specific at first.
> (Not XPath).
>
> I also added a INFO logging in the XPathBuilder so it logs once when
> it initializes the XPathFactory. This allows you to know which factory
> is used
> INFO  XPathBuilder                   - Created default XPathFactory
> com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f
> For example if you have Saxon on the classpath it may use that instead.
>
> For example to split 40.000 elements using the JDK XPath Engine
> - Processed file with 40000 elements in: 45.521 seconds   (uses about 98mb)
>
> And 40.000 elements with the tokenizer
> - Processed file with 40000 elements in: 47.291 seconds   (uses about 6mb)
>
> And 200.000 elements with the tokenizer
> - Processed file with 200000 elements in: 3 minutes   (uses about 14mb)
>
> I could not run the 200.000 elements with XPath as it hit OOME (unless
> I bump up the JVM memory allocations a lot)
>
> So its not really about speed, but about memory usages. The tokenizer
> is very low memory usages, where as XPath will just keep eating
> memory.
> Now if the XML data was very big then only the tokenizer would be able
> to split the file.
>
> The tokenizer is of course not using a real XPath expression, so you
> can only split by chopping out a "record" of you XML file.
> But if you structure your XML data as follows, then the tokenizer can handle it:
> <records>
>  <record id="1">
>  </record>
>  <record id="2">
>  </record>
>  <record id="3">
>  </record>
>   ....
>  <record id="N">
>  </record>
> </records>
>
> Also the tokenizer can support non XML as well, in case you have
> special START/END tokens for your records.
>
>
> What about other XPath libraries?
> Yes there is a few out there. Some is not so active maintained (I
> guess some the XML hyper is over now) and others have a GPL license or
> other kind
> of license that prevents us to use it at Apache
> http://www.apache.org/legal/3party.html#define-thirdpartywork
>
>
>
> --
> Claus Ibsen
> -----------------
> FuseSource
> Email: [hidden email]
> Web: http://fusesource.com
> Twitter: davsclaus, fusenews
> Blog: http://davsclaus.blogspot.com/
> Author of Camel in Action: http://www.manning.com/ibsen/
>



--
Claus Ibsen
-----------------
FuseSource
Email: [hidden email]
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [HEADS UP] - Splitting big XML files using XPath

Christian Mueller
This is a very good improvement. Thank you Claus!

We should also have a "solution for the enterprise" users which often use
namespaces like this:
<records xmlns="http://foo" xmlns:bar="http://bar">
 <record id="1">
 </record>
 <record id="2">
 </record>
 <record id="3">
 </record>
  ....
 <record id="N">
 </record>
</records>

After splitting the large XML files into its individual parts, we should
have something like:
<record id="1" xmlns="http://foo" xmlns:bar="http://bar">
</record>

Best,
Christian
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [HEADS UP] - Splitting big XML files using XPath

Claus Ibsen-2
On Sun, Oct 30, 2011 at 9:38 PM, Christian Müller
<[hidden email]> wrote:

> This is a very good improvement. Thank you Claus!
>
> We should also have a "solution for the enterprise" users which often use
> namespaces like this:
> <records xmlns="http://foo" xmlns:bar="http://bar">
>  <record id="1">
>  </record>
>  <record id="2">
>  </record>
>  <record id="3">
>  </record>
>  ....
>  <record id="N">
>  </record>
> </records>
>
> After splitting the large XML files into its individual parts, we should
> have something like:
> <record id="1" xmlns="http://foo" xmlns:bar="http://bar">
> </record>
>

Yeah that would be something we need to look into as well.

When using the StAX / JAXB / DOM API in the JDK then it can support
this as well. But I have noticed a significant performance impact.

So with the tokenizer we could probably add support for the end user
provides a root tag, which then the tokenizer will grab the namespace
declarations from, and inject into the splitted stream messages.

That should support the use-case outlined by Christian, which is also
the uses cases I have encountered by far the most.
Setting namespaces on the root tag once.






> Best,
> Christian
>



--
Claus Ibsen
-----------------
FuseSource
Email: [hidden email]
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [HEADS UP] - Splitting big XML files using XPath

Claus Ibsen-2
In reply to this post by Christian Mueller
On Sun, Oct 30, 2011 at 9:38 PM, Christian Müller
<[hidden email]> wrote:

> This is a very good improvement. Thank you Claus!
>
> We should also have a "solution for the enterprise" users which often use
> namespaces like this:
> <records xmlns="http://foo" xmlns:bar="http://bar">
>  <record id="1">
>  </record>
>  <record id="2">
>  </record>
>  <record id="3">
>  </record>
>  ....
>  <record id="N">
>  </record>
> </records>
>
> After splitting the large XML files into its individual parts, we should
> have something like:
> <record id="1" xmlns="http://foo" xmlns:bar="http://bar">
> </record>
>

This is now supported using the tokenizeXML, for example in XML

  <camelContext xmlns="http://camel.apache.org/schema/spring">
    <route>
      <from uri="file:target/pair"/>
      <split streaming="true">
        <!-- split the file using XML tokenizer, where we grab the record tag,
             and inherit the namespaces from the parent/root records tag
             the xml attribute must be set to true, to enable XML mode -->
        <tokenize token="record" inheritNamespaceTagName="records" xml="true"/>
        <to uri="mock:split"/>
      </split>
    </route>
  </camelContext>

In Java code you simply do

 from("file:target/pair")
                    .split().streaming().tokenizeXML("record", "records")
                        .to("mock:split");


> Best,
> Christian
>



--
Claus Ibsen
-----------------
FuseSource
Email: [hidden email]
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/
Loading...