|
Hi
I recently had a look at improving the XSLT, XQuery and XPath components in Camel. For example these first two of these components now supports StAX as Source. And prefer StAX/SAX over DOM etc. For StAX you will need to enable it using allowStAX option (to be backwards compatible) The latter (XPath) does not support this, because its javax API is limited. Likewise the XPath engine in the JDK does not support streaming, so we end up loading the content into a DOM in memory. So this means that when people are trying to split a big XML file with XPath in Camel, they hit OOME or have a solution that eats up memory and the system becomes slower. The solution is to build a custom expression that will iterate the file source in pieces and do the "XPath splitting" manually. So I have enhanced the tokenizer language in Camel so it can do this for you. See the sections: - strem based - streaming big XML payloads using Tokenizer language at http://camel.apache.org/splitter The idea is that you provide a start and end token, and then the tokenizer will chop the payload by grabbing the content between those tokens. All in a streamed fashion using the java.util.Scanner from the JDK. I added some unit tests to simulate big data and to output performance in camel-core - TokenPairIteratorSplitChoicePerformanceTest - XPathSplitChoicePerformanceTest As well in camel-saxon we have a unit test as well - XPathSplitChoicePerformanceTest I noticed Saxon is faster than the JDK XPath engine, but they both eat up memory. I looked at Saxon and they are starting to support streaming but only in their EE version (which you need to buy a license for) and the streaming seems to be XSTL specific at first. (Not XPath). I also added a INFO logging in the XPathBuilder so it logs once when it initializes the XPathFactory. This allows you to know which factory is used INFO XPathBuilder - Created default XPathFactory com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f For example if you have Saxon on the classpath it may use that instead. For example to split 40.000 elements using the JDK XPath Engine - Processed file with 40000 elements in: 45.521 seconds (uses about 98mb) And 40.000 elements with the tokenizer - Processed file with 40000 elements in: 47.291 seconds (uses about 6mb) And 200.000 elements with the tokenizer - Processed file with 200000 elements in: 3 minutes (uses about 14mb) I could not run the 200.000 elements with XPath as it hit OOME (unless I bump up the JVM memory allocations a lot) So its not really about speed, but about memory usages. The tokenizer is very low memory usages, where as XPath will just keep eating memory. Now if the XML data was very big then only the tokenizer would be able to split the file. The tokenizer is of course not using a real XPath expression, so you can only split by chopping out a "record" of you XML file. But if you structure your XML data as follows, then the tokenizer can handle it: <records> <record id="1"> </record> <record id="2"> </record> <record id="3"> </record> .... <record id="N"> </record> </records> Also the tokenizer can support non XML as well, in case you have special START/END tokens for your records. What about other XPath libraries? Yes there is a few out there. Some is not so active maintained (I guess some the XML hyper is over now) and others have a GPL license or other kind of license that prevents us to use it at Apache http://www.apache.org/legal/3party.html#define-thirdpartywork -- Claus Ibsen ----------------- FuseSource Email: [hidden email] Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/ |
|
Romain have worked on a StAX expression iterator which allows to split
big XML files as well but using the JAXB/StAX API. https://issues.apache.org/jira/browse/CAMEL-3966 This requires end users to have model classes with JAXB annotations, which you then use as matcher in the iterator. So you would have a Records and Record classes with JAXB annotations. This would also be a solution but is of course pure XML based as well requires model classes. However I like this approach. And could be a base for a StAXBuilder that Christian Mueller have proposed in https://issues.apache.org/jira/browse/CAMEL-3998 On Sun, Oct 30, 2011 at 10:50 AM, Claus Ibsen <[hidden email]> wrote: > Hi > > I recently had a look at improving the XSLT, XQuery and XPath > components in Camel. > > For example these first two of these components now supports StAX as Source. > And prefer StAX/SAX over DOM etc. For StAX you will need to enable it > using allowStAX option (to be backwards compatible) > > The latter (XPath) does not support this, because its javax API is limited. > Likewise the XPath engine in the JDK does not support streaming, so we > end up loading the content into a DOM in memory. > > So this means that when people are trying to split a big XML file with > XPath in Camel, they hit OOME or have a solution that eats up memory > and the system becomes slower. > > The solution is to build a custom expression that will iterate the > file source in pieces and do the "XPath splitting" manually. > So I have enhanced the tokenizer language in Camel so it can do this for you. > > See the sections: > - strem based > - streaming big XML payloads using Tokenizer language > at http://camel.apache.org/splitter > > The idea is that you provide a start and end token, and then the > tokenizer will chop the payload by grabbing the content between those > tokens. > All in a streamed fashion using the java.util.Scanner from the JDK. > > I added some unit tests to simulate big data and to output performance > in camel-core > - TokenPairIteratorSplitChoicePerformanceTest > - XPathSplitChoicePerformanceTest > > As well in camel-saxon we have a unit test as well > - XPathSplitChoicePerformanceTest > > I noticed Saxon is faster than the JDK XPath engine, but they both eat > up memory. I looked at Saxon and they are starting to support > streaming but only in their EE version (which you need to buy a > license for) and the streaming seems to be XSTL specific at first. > (Not XPath). > > I also added a INFO logging in the XPathBuilder so it logs once when > it initializes the XPathFactory. This allows you to know which factory > is used > INFO XPathBuilder - Created default XPathFactory > com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f > For example if you have Saxon on the classpath it may use that instead. > > For example to split 40.000 elements using the JDK XPath Engine > - Processed file with 40000 elements in: 45.521 seconds (uses about 98mb) > > And 40.000 elements with the tokenizer > - Processed file with 40000 elements in: 47.291 seconds (uses about 6mb) > > And 200.000 elements with the tokenizer > - Processed file with 200000 elements in: 3 minutes (uses about 14mb) > > I could not run the 200.000 elements with XPath as it hit OOME (unless > I bump up the JVM memory allocations a lot) > > So its not really about speed, but about memory usages. The tokenizer > is very low memory usages, where as XPath will just keep eating > memory. > Now if the XML data was very big then only the tokenizer would be able > to split the file. > > The tokenizer is of course not using a real XPath expression, so you > can only split by chopping out a "record" of you XML file. > But if you structure your XML data as follows, then the tokenizer can handle it: > <records> > <record id="1"> > </record> > <record id="2"> > </record> > <record id="3"> > </record> > .... > <record id="N"> > </record> > </records> > > Also the tokenizer can support non XML as well, in case you have > special START/END tokens for your records. > > > What about other XPath libraries? > Yes there is a few out there. Some is not so active maintained (I > guess some the XML hyper is over now) and others have a GPL license or > other kind > of license that prevents us to use it at Apache > http://www.apache.org/legal/3party.html#define-thirdpartywork > > > > -- > Claus Ibsen > ----------------- > FuseSource > Email: [hidden email] > Web: http://fusesource.com > Twitter: davsclaus, fusenews > Blog: http://davsclaus.blogspot.com/ > Author of Camel in Action: http://www.manning.com/ibsen/ > -- Claus Ibsen ----------------- FuseSource Email: [hidden email] Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/ |
|
This is a very good improvement. Thank you Claus!
We should also have a "solution for the enterprise" users which often use namespaces like this: <records xmlns="http://foo" xmlns:bar="http://bar"> <record id="1"> </record> <record id="2"> </record> <record id="3"> </record> .... <record id="N"> </record> </records> After splitting the large XML files into its individual parts, we should have something like: <record id="1" xmlns="http://foo" xmlns:bar="http://bar"> </record> Best, Christian |
|
On Sun, Oct 30, 2011 at 9:38 PM, Christian Müller
<[hidden email]> wrote: > This is a very good improvement. Thank you Claus! > > We should also have a "solution for the enterprise" users which often use > namespaces like this: > <records xmlns="http://foo" xmlns:bar="http://bar"> > <record id="1"> > </record> > <record id="2"> > </record> > <record id="3"> > </record> > .... > <record id="N"> > </record> > </records> > > After splitting the large XML files into its individual parts, we should > have something like: > <record id="1" xmlns="http://foo" xmlns:bar="http://bar"> > </record> > Yeah that would be something we need to look into as well. When using the StAX / JAXB / DOM API in the JDK then it can support this as well. But I have noticed a significant performance impact. So with the tokenizer we could probably add support for the end user provides a root tag, which then the tokenizer will grab the namespace declarations from, and inject into the splitted stream messages. That should support the use-case outlined by Christian, which is also the uses cases I have encountered by far the most. Setting namespaces on the root tag once. > Best, > Christian > -- Claus Ibsen ----------------- FuseSource Email: [hidden email] Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/ |
|
In reply to this post by Christian Mueller
On Sun, Oct 30, 2011 at 9:38 PM, Christian Müller
<[hidden email]> wrote: > This is a very good improvement. Thank you Claus! > > We should also have a "solution for the enterprise" users which often use > namespaces like this: > <records xmlns="http://foo" xmlns:bar="http://bar"> > <record id="1"> > </record> > <record id="2"> > </record> > <record id="3"> > </record> > .... > <record id="N"> > </record> > </records> > > After splitting the large XML files into its individual parts, we should > have something like: > <record id="1" xmlns="http://foo" xmlns:bar="http://bar"> > </record> > This is now supported using the tokenizeXML, for example in XML <camelContext xmlns="http://camel.apache.org/schema/spring"> <route> <from uri="file:target/pair"/> <split streaming="true"> <!-- split the file using XML tokenizer, where we grab the record tag, and inherit the namespaces from the parent/root records tag the xml attribute must be set to true, to enable XML mode --> <tokenize token="record" inheritNamespaceTagName="records" xml="true"/> <to uri="mock:split"/> </split> </route> </camelContext> In Java code you simply do from("file:target/pair") .split().streaming().tokenizeXML("record", "records") .to("mock:split"); > Best, > Christian > -- Claus Ibsen ----------------- FuseSource Email: [hidden email] Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/ |
| Powered by Nabble | Edit this page |
