Use-case question: Can Camel be used for dynamic ETL-tool?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Use-case question: Can Camel be used for dynamic ETL-tool?

britske
Hi,

I have the need of creating a dynamic ETL-tool, which basiscally must extract/ transform / translate information from various html/xml sources and persist this into a rdbms. Information loaded into the db must conform to a specific target-schema.

The rules for this extraction should be easy to create (in the end using a GUI which spits out an xml-definition of some sort) as various sources may be added which need different translation rules to conform to the target schema.

Different lookup-tables, and other external data-sources need to be incorporated to translate infomration-parts, for instance currency-conversion, etc.

Is Camel suited for such a use-case?

Thanks in advance,
Geert-Jan
Reply | Threaded
Open this post in threaded view
|

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

jstrachan
On 7/24/07, Britske <[hidden email]> wrote:

>
> Hi,
>
> I have the need of creating a dynamic ETL-tool, which basiscally must
> extract/ transform / translate information from various html/xml sources and
> persist this into a rdbms. Information loaded into the db must conform to a
> specific target-schema.
>
> The rules for this extraction should be easy to create (in the end using a
> GUI which spits out an xml-definition of some sort) as various sources may
> be added which need different translation rules to conform to the target
> schema.
>
> Different lookup-tables, and other external data-sources need to be
> incorporated to translate infomration-parts, for instance
> currency-conversion, etc.
>
> Is Camel suited for such a use-case?

I'd have thought so. We've got most of the components and processors
you need together with a nice DSL (java or XML) for wiring stuff
together including file processing, XML validation, pluggable
transformers/processors then writing to the database (currently via
JPA but we could support pure JDBC or iBatis easily etc) along with
SEDA and pluggable filters etc.

We might have to tweak a few things along the way to smooth out some
rough edges to make things easier to use, but I think it should work -
its more a question of how well; but I'd love to make Camel as easy to
use as possible for a powerful ETL tool.

e.g. its easy to start with a single route of the form..

from("file://some/directory").process(someProcessor).process(anotherProcessor).to("jpa:SomePojo");

to add some processors to transform the file into something else, then
store it in a database using JPA and some POJO.

It might be you want to split things up into a SEDA based parallel
number of steps. e.g.

from("file://something").to("seda:toXml");
from("seda:toXml").process(someFileToXmlConverter).to("seda:validate");

...

and separate out the steps into parallel SEDA queues etc. Note that
the above Java code could be written using XML if you prefer (or
generated by some tool etc).

--
James
-------
http://macstrac.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

britske
Thanks for the quick reply: sounds good.

Do you know of any example code that would link camel and hibernate together? As being able to change the target-datamodel would be a requirement as well, while maintaining the target-schema (Hibernate can be used for the abstraction of this. not sure about performance through hibernate for bulk-updates though..)

Another thing, just to be certain, leveraging the power op ActiveMQ would enable the ETL-tool to easily scale over multiple servers / processors / threads right? I havn't used this stack (ActiveMQ / Camel) before at all, but the message-paradigm seems to be the perfect solution for this. With the big proability of stating the obvious ;-).

This would mean that pipelines operating in different threads or even different servers need to be able to handle shared queues with all the locking / concurrency stuff and pipelines in different servers being able to ping eachother to go to work. Is this possible as well?  Is this what you are referring to as parallel SEDA queues? I think I have to read up..

Last thing: the ETL -tool can get various inputs.  One of which is a webcrawler which is scheduled peridically to get some html (based on patterns or whatever). Would/could such a multithreaded crawler in your opinion be an ingral part of the etl-tool cq. one of the 'input-camel-pipelines'?  pros and cons would be highly appreciated! ;-)

this sounds exciting!
Thanks in advance,

Geert-Jan



James.Strachan wrote
On 7/24/07, Britske <gbrits@gmail.com> wrote:
>
> Hi,
>
> I have the need of creating a dynamic ETL-tool, which basiscally must
> extract/ transform / translate information from various html/xml sources and
> persist this into a rdbms. Information loaded into the db must conform to a
> specific target-schema.
>
> The rules for this extraction should be easy to create (in the end using a
> GUI which spits out an xml-definition of some sort) as various sources may
> be added which need different translation rules to conform to the target
> schema.
>
> Different lookup-tables, and other external data-sources need to be
> incorporated to translate infomration-parts, for instance
> currency-conversion, etc.
>
> Is Camel suited for such a use-case?

I'd have thought so. We've got most of the components and processors
you need together with a nice DSL (java or XML) for wiring stuff
together including file processing, XML validation, pluggable
transformers/processors then writing to the database (currently via
JPA but we could support pure JDBC or iBatis easily etc) along with
SEDA and pluggable filters etc.

We might have to tweak a few things along the way to smooth out some
rough edges to make things easier to use, but I think it should work -
its more a question of how well; but I'd love to make Camel as easy to
use as possible for a powerful ETL tool.

e.g. its easy to start with a single route of the form..

from("file://some/directory").process(someProcessor).process(anotherProcessor).to("jpa:SomePojo");

to add some processors to transform the file into something else, then
store it in a database using JPA and some POJO.

It might be you want to split things up into a SEDA based parallel
number of steps. e.g.

from("file://something").to("seda:toXml");
from("seda:toXml").process(someFileToXmlConverter).to("seda:validate");

...

and separate out the steps into parallel SEDA queues etc. Note that
the above Java code could be written using XML if you prefer (or
generated by some tool etc).

--
James
-------
http://macstrac.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

jstrachan
On 7/25/07, Britske <[hidden email]> wrote:
> Thanks for the quick reply: sounds good.
>
> Do you know of any example code that would link camel and hibernate
> together?

The use of the JPA endpoint is a good start...
http://cwiki.apache.org/CAMEL/jpa.html
(I've just updated the docs hence the strange URL)

basically this component takes any entity bean (a POJO with an @Entity
annotation) and stores it in a JPA provider like Hibernate, OpenJPA or
TopLink.

So if you parsed some file then transformed it into some entity bean
you could then persist it in hibernate by just routing it to a JPA
endpoint.

The JPA endpoint basically assumes that its given an enitty bean in
the message body and persists it; so the endpoint can deal with any
kind of JPA-enabled POJO. You could use multiple endpoints for
different persistence contexts (e.g. different DBs or schemas etc) if
you need to.


> As being able to change the target-datamodel would be a
> requirement as well, while maintaining the target-schema (Hibernate can be
> used for the abstraction of this. not sure about performance through
> hibernate for bulk-updates though..)

Using batch transactions should really help performance. Nothings ever
close to the performance of the raw DB dump tools that the database
vendors provide; but using bulk-updates with large transaction batches
should be pretty fast.


> Another thing, just to be certain, leveraging the power op ActiveMQ would
> enable the ETL-tool to easily scale over multiple servers / processors /
> threads right?

Definitely. You can use the ActiveMQ component to load balance across
consumers on an ActiveMQ queue...
http://activemq.apache.org/camel/activemq.html

or you could use another JMS provider of your choice (though why use
any other provider when ActiveMQ is so good? :)
http://activemq.apache.org/camel/jms.html

finally if you want you could use in-JVM load balancing across a
thread pool (which is fine until you get CPU bound on a box)
http://activemq.apache.org/camel/seda.html


> I havn't used this stack (ActiveMQ / Camel) before at all,

If you're new to ActiveMQ I'd recommend starting with SEDA to get to
grips with asynchronous SEDA based processing;
http://activemq.apache.org/camel/seda.html

then move on to distributed SEDA (using JMS queues) later on when
you're feeling more confident.


> but the message-paradigm seems to be the perfect solution for this. With the
> big proability of stating the obvious ;-).

:)


> This would mean that pipelines operating in different threads or even
> different servers need to be able to handle shared queues with all the
> locking / concurrency stuff and pipelines in different servers being able to
> ping eachother to go to work. Is this possible as well?  Is this what you
> are referring to as parallel SEDA queues? I think I have to read up..

Yeah. Whether using the SEDA or a JMS component, each thread is going
to process things concurrently. So you may want to consider ordering
and concurrency issues and for some things you may need some kind of
concurrency lock. Its a whole large topic in and of itself ;-) - but
the quick 30,000ft view is...

* for distributed locking, try use a database by default, they are
very good at it :)

* to preserve ordering across a JMS queue consider using Exclusive Consumers
http://activemq.apache.org/exclusive-consumer.html

or even better, Message Groups which allows you to preserve ordering
across messages while still offering parallelisation via the
JMSXGrouopID header to determine what can be parallelized
http://activemq.apache.org/message-groups.html


A good rule of thumb to help reduce concurrency problems is to make
sure each single can be processed as an atomic unit in parallel
(either without concurrency issues or using say, database locking); or
if it can't, use a Message Group to relate the messages together which
need to be processed in order by a single thread.


> Last thing: the ETL -tool can get various inputs.  One of which is a
> webcrawler which is scheduled peridically to get some html (based on
> patterns or whatever). Would/could such a multithreaded crawler in your
> opinion be an ingral part of the etl-tool cq. one of the
> 'input-camel-pipelines'?

Definitely! We should certainly do a web crawler/spider component.
We've got a file crawler so far, but not a web one yet.


> pros and cons would be highly appreciated! ;-)

You could certainly use any off-the-shelf spiders to then create
files, that Camel could spider today. Or you could plugin to some Java
spider and when a new page is hit you could send a message into a
Camel endpoint using a CamelTemplate.

Ideally though we'd create a web spider component so it'd be really
easy to setup EIP routes using a web spider as input - then we can use
the full power of the Enterprise Integration Patterns and Camel within
the web spider.


> this sounds exciting!

Agreed! :)


> Thanks in advance,

You're most welcome!


--
James
-------
http://macstrac.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

britske
I've been wondering what mechanisms can be used to help in automating the transformation fom xml to entitybeans (pojo's) . I know (that is to say i've heard about) libraries such as Jibx which should help with this conversion by providing binding-configurations.

The specific requirements that I have is that:
- pojo's are fixed (known a priori / at design time)
- formats of sourcefiles (xmls) are known at design time.

so the mechansim that I would like to use should have the characteristic that its possible to 'hot-deploy' binding-configurations at run time.

I see 2 options for this:

1. simply use a library such as Jibx that supprorts this flexible / runtime binding. (I don't know itfJibX does this or not)

2.
- for each entitybean make 1 target binding at runtime. As the pojo is known at runtime, this binding can be made at runtime as well.
- let this target-binding be defined as a xml-schema (which library lets me do this?)
-for each new discovered source-file (which follows a certain pattern) define a xslt-mapping between source to target.
- additional benefit is that the file can be validated against the xml-schema after xslt-transformation.

I prefer option 2, since :
- several xslt-processors are available with good performance-characteristics.
- Xslt can (I bet ) be perfectly integrated in a Camel pipeline,
- the problem of translating from xml to pojo is concentrated in only 1 mapping file per pojo, which sounds good from a maintenance standpoint.

I'm interested to hear your view on this. Does Camel provide certain providers to help tackle this? Do you have any experience in using xml to pojo conversion with these requirements?  What option do you prefer of do you see a different solution altogether?

I think your comment on using a database for distributed locking got me asking this question, but isn't it an integral part of Camel on ActiveMQ stack to to take care of correct locking on a JMS queue, etc.? . Why should a database be needed for this? I must be missing something obvious here..

Message Groups sound good by the way!

definitely going to test out some scenario's with Camel somewhere soon!

Geert-Jan


James.Strachan wrote
On 7/25/07, Britske <gbrits@gmail.com> wrote:
> Thanks for the quick reply: sounds good.
>
> Do you know of any example code that would link camel and hibernate
> together?

The use of the JPA endpoint is a good start...
http://cwiki.apache.org/CAMEL/jpa.html
(I've just updated the docs hence the strange URL)

basically this component takes any entity bean (a POJO with an @Entity
annotation) and stores it in a JPA provider like Hibernate, OpenJPA or
TopLink.

So if you parsed some file then transformed it into some entity bean
you could then persist it in hibernate by just routing it to a JPA
endpoint.

The JPA endpoint basically assumes that its given an enitty bean in
the message body and persists it; so the endpoint can deal with any
kind of JPA-enabled POJO. You could use multiple endpoints for
different persistence contexts (e.g. different DBs or schemas etc) if
you need to.


> As being able to change the target-datamodel would be a
> requirement as well, while maintaining the target-schema (Hibernate can be
> used for the abstraction of this. not sure about performance through
> hibernate for bulk-updates though..)

Using batch transactions should really help performance. Nothings ever
close to the performance of the raw DB dump tools that the database
vendors provide; but using bulk-updates with large transaction batches
should be pretty fast.


> Another thing, just to be certain, leveraging the power op ActiveMQ would
> enable the ETL-tool to easily scale over multiple servers / processors /
> threads right?

Definitely. You can use the ActiveMQ component to load balance across
consumers on an ActiveMQ queue...
http://activemq.apache.org/camel/activemq.html

or you could use another JMS provider of your choice (though why use
any other provider when ActiveMQ is so good? :)
http://activemq.apache.org/camel/jms.html

finally if you want you could use in-JVM load balancing across a
thread pool (which is fine until you get CPU bound on a box)
http://activemq.apache.org/camel/seda.html


> I havn't used this stack (ActiveMQ / Camel) before at all,

If you're new to ActiveMQ I'd recommend starting with SEDA to get to
grips with asynchronous SEDA based processing;
http://activemq.apache.org/camel/seda.html

then move on to distributed SEDA (using JMS queues) later on when
you're feeling more confident.


> but the message-paradigm seems to be the perfect solution for this. With the
> big proability of stating the obvious ;-).

:)


> This would mean that pipelines operating in different threads or even
> different servers need to be able to handle shared queues with all the
> locking / concurrency stuff and pipelines in different servers being able to
> ping eachother to go to work. Is this possible as well?  Is this what you
> are referring to as parallel SEDA queues? I think I have to read up..

Yeah. Whether using the SEDA or a JMS component, each thread is going
to process things concurrently. So you may want to consider ordering
and concurrency issues and for some things you may need some kind of
concurrency lock. Its a whole large topic in and of itself ;-) - but
the quick 30,000ft view is...

* for distributed locking, try use a database by default, they are
very good at it :)

* to preserve ordering across a JMS queue consider using Exclusive Consumers
http://activemq.apache.org/exclusive-consumer.html

or even better, Message Groups which allows you to preserve ordering
across messages while still offering parallelisation via the
JMSXGrouopID header to determine what can be parallelized
http://activemq.apache.org/message-groups.html


A good rule of thumb to help reduce concurrency problems is to make
sure each single can be processed as an atomic unit in parallel
(either without concurrency issues or using say, database locking); or
if it can't, use a Message Group to relate the messages together which
need to be processed in order by a single thread.


> Last thing: the ETL -tool can get various inputs.  One of which is a
> webcrawler which is scheduled peridically to get some html (based on
> patterns or whatever). Would/could such a multithreaded crawler in your
> opinion be an ingral part of the etl-tool cq. one of the
> 'input-camel-pipelines'?

Definitely! We should certainly do a web crawler/spider component.
We've got a file crawler so far, but not a web one yet.


> pros and cons would be highly appreciated! ;-)

You could certainly use any off-the-shelf spiders to then create
files, that Camel could spider today. Or you could plugin to some Java
spider and when a new page is hit you could send a message into a
Camel endpoint using a CamelTemplate.

Ideally though we'd create a web spider component so it'd be really
easy to setup EIP routes using a web spider as input - then we can use
the full power of the Enterprise Integration Patterns and Camel within
the web spider.


> this sounds exciting!

Agreed! :)


> Thanks in advance,

You're most welcome!


--
James
-------
http://macstrac.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: Use-case question: Can Camel be used for dynamic ETL-tool?

jstrachan
On 7/26/07, Britske <[hidden email]> wrote:
> I've been wondering what mechanisms can be used to help in automating the
> transformation fom xml to entitybeans (pojo's) . I know (that is to say i've
> heard about) libraries such as Jibx which should help with this conversion
> by providing binding-configurations.

Yeah. You could try just use JAXB2 if the mapping is fairly simple
between the XML and your database; using a single POJO with JPA and
JAXB2 annotations. Or if you rather keep them separate, you could have
POJOs for the XML and then entity beans and a little converter between
them both. FWIW the ETL example I've hacked up uses this approach so
that the XML can change independently from the database etc.

Using a single POJO for both JPA and JAXB2 breaks down if ever you
have to support multiple XML formats for the same entity model (unless
you fork your code or something - maybe have derived classes to add
different annotations, but that can get kinda messy).


> The specific requirements that I have is that:
> - pojo's are fixed (known a priori / at design time)
> - formats of sourcefiles (xmls) are known at design time.
>
> so the mechansim that I would like to use should have the characteristic
> that its possible to 'hot-deploy' binding-configurations at run time.

Yeah - hot deployment can be done via an ESB like ServiceMix, via OSGi
or using WARs etc.


> I see 2 options for this:
>
> 1. simply use a library such as Jibx that supprorts this flexible / runtime
> binding. (I don't know itfJibX does this or not)

If the XML mapping is reasonably straightforward - or you are in
control of the XML format - then I'd definitely recommend JAXB2.  With
JAXB2 its easy to change namespaces, element/attribute names and move
things from being elements to attributes and so forth - also adding
wrapper elements when required etc.

However if the XML can change in wacky ways, a more flexble mapping
library like Jibx might be a good idea.


> 2.
> - for each entitybean make 1 target binding at runtime. As the pojo is known
> at runtime, this binding can be made at runtime as well.
> - let this target-binding be defined as a xml-schema (which library lets me
> do this?)

JAXB2 is my recommendation; you can start XSD first if you like - or
POJO first with annotations and generate an XSD.

Incidentally the XML configuration of Camel now uses the latter
approach; annotating POJOS and generating an XSD for the XML/Spring
configuration.

(The generation of the XSD takes place in the camel-spring module if
you're interested in the maven ninja for that).


> -for each new discovered source-file (which follows a certain pattern)
> define a xslt-mapping between source to target.
> - additional benefit is that the file can be validated against the
> xml-schema after xslt-transformation.
>
> I prefer option 2, since :
> - several xslt-processors are available with good
> performance-characteristics.
> - Xslt can (I bet ) be perfectly integrated in a Camel pipeline,

Definitely; as can XQuery too


> - the problem of translating from xml to pojo is concentrated in only 1
> mapping file per pojo, which sounds good from a maintenance standpoint.
>
> I'm interested to hear your view on this. Does Camel provide certain
> providers to help tackle this?

Sure - its more a case of choosing the right approach for what you want to do.

> Do you have any experience in using xml to
> pojo conversion with these requirements?  What option do you prefer of do
> you see a different solution altogether?

I've done my fair share of XSLT over the years and I tend to try avoid
it whenever I can :)

I've used heaps of XML tools over the years (from writing dom4j &
jaxen way back when), to the other DOMs, to XMLBeans etc. JAXB2 really
does rock at solving the Object-XML mapping problem. It might not
quite fit every single use case, but it certainly fits most of them
IMHO.

e.g. take any XML document you ever get; run it through Trang to get a
schema, then let JAXB2 create the POJOs to read and write that XML
document for you and you're done - you can stay in the nice,
IDE-friendly world of Java :)

http://www.thaiopensource.com/relaxng/trang-manual.html

I think the one area JAXB2 falls down is if you want to support
multiple XML formats for the same POJO model; in JAXB2 you'd
essentially have different POJOs for each XML model (though you can
use base classes and so forth to help minimise issues with this
approach).


> I think your comment on using a database for distributed locking got me
> asking this question, but isn't it an integral part of Camel on ActiveMQ
> stack to to take care of correct locking on a JMS queue, etc.?

Sorry I should have been more clear. There is no locking issues with
JMS or ActiveMQ at all. JMS by design removes the need for locking; as
the message broker takes care of all of that for you.

By distributed locking I was more thinking of things like - if you
wanted to process a TransferMoney message that moves money from one
account to another; and you are processing these messages
concurrently, then using a database to ensure that the accounts are
updated atomically is a good idea.


> Why should
> a database be needed for this? I must be missing something obvious here..
>
> Message Groups sound good by the way!

Yeah, they rock!

BTW I've added some documentation to the cookbook based on our conversations...
http://cwiki.apache.org/CAMEL/cookbook.html

in particular
http://cwiki.apache.org/CAMEL/etl.html

as well as
http://cwiki.apache.org/CAMEL/parallel-processing-and-ordering.html

and there's an early example here...
http://cwiki.apache.org/CAMEL/etl-example.html

it needs a bit of work, but its getting close...


> definitely going to test out some scenario's with Camel somewhere soon!

Great! :)

--
James
-------
http://macstrac.blogspot.com/