Camel + Drill + Parquet

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Camel + Drill + Parquet

Ron Cecchini
Hi, all.  I'm just looking for quick guidance or confirmation that I'm going in the right direction here:

- There's a small Kotlin service that uses Camel to read from Kafka and write to Mongo.
- I need to replace Mongo with Apache Drill and write Parquet files to the file system.
  (I know nothing about Parquet but I know a little bit about Drill.)

- This service isn't used to do any queries, it's just for persisting data.
  So, given that, and the fact that Drill is just a query engine, I really can't use the "Drill" component for anything.

- But there is that "HDFS" component that I think I can use?
  Or maybe the "File" component is better here?

So my thinking is that I just need to:

1. write a Processor to transform the JSON data into Parquet
   (and keep in mind that I know nothing about Parquet...)

2. use the HDFS (or File) component to write it to a file
   (I think there's some Parquet set up to do (?) outside the scope of this service, but that's another matter...)

Seems pretty straight-forward.  Does that sound reasonable?

Are there any Camel examples I can look at?  The Google machine seems to not find anything related to Camel and Parquet...

Thank you so much!

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Camel + Drill + Parquet

Omar Al-Safi
Hi Ron,

By reading some introduction in Apache Drill, I'd say the file component
would be more suitable to write parquet files.
In regards to parquet and Camel, we don't have an example for it but the
way I see it, you are heading into the right direction by creating a
processor to convert the data to parquet format.
However, we do have an open feature request
<https://issues.apache.org/jira/browse/CAMEL-13573> to add parquet data
format, we would love to see some contributions to add this to Camel :) .

Regards,
Omar


On Tue, Feb 11, 2020 at 11:37 PM Ron Cecchini <[hidden email]>
wrote:

> Hi, all.  I'm just looking for quick guidance or confirmation that I'm
> going in the right direction here:
>
> - There's a small Kotlin service that uses Camel to read from Kafka and
> write to Mongo.
> - I need to replace Mongo with Apache Drill and write Parquet files to the
> file system.
>   (I know nothing about Parquet but I know a little bit about Drill.)
>
> - This service isn't used to do any queries, it's just for persisting data.
>   So, given that, and the fact that Drill is just a query engine, I really
> can't use the "Drill" component for anything.
>
> - But there is that "HDFS" component that I think I can use?
>   Or maybe the "File" component is better here?
>
> So my thinking is that I just need to:
>
> 1. write a Processor to transform the JSON data into Parquet
>    (and keep in mind that I know nothing about Parquet...)
>
> 2. use the HDFS (or File) component to write it to a file
>    (I think there's some Parquet set up to do (?) outside the scope of
> this service, but that's another matter...)
>
> Seems pretty straight-forward.  Does that sound reasonable?
>
> Are there any Camel examples I can look at?  The Google machine seems to
> not find anything related to Camel and Parquet...
>
> Thank you so much!
>
> Ron
>
Reply | Threaded
Open this post in threaded view
|

Re: Camel + Drill + Parquet

Ron Cecchini
Thanks, Omar.

As it turns out, Parquet is not the way to go since it looks like it is geared more toward data warehousing, whereas I need to persist streaming data - and from what I can gather, I would need the overhead of Spark or Hive to accomplish that with Parquet (appending to a growing Parquet file).

*However*, it looks like Apache Kudu is exactly what we need.  And not only does Camel already provide a Kudu component, as coincidence would have it it looks like you co-authored.  Awesome!

Moreover, Kudu takes just a Map as input, and not an Avro formatted message or whatever like Parquet.  So migrating this Kafka->Mongo route to Kafka->Kudu is almost trivial.

Anyway, time to bump up my Camel version to 3.0.1 and give Kudu a whirl...

Thanks again.

> On February 12, 2020 at 4:33 AM Omar Al-Safi <[hidden email]> wrote:
>
>
> Hi Ron,
>
> By reading some introduction in Apache Drill, I'd say the file component
> would be more suitable to write parquet files.
> In regards to parquet and Camel, we don't have an example for it but the
> way I see it, you are heading into the right direction by creating a
> processor to convert the data to parquet format.
> However, we do have an open feature request
> <https://issues.apache.org/jira/browse/CAMEL-13573> to add parquet data
> format, we would love to see some contributions to add this to Camel :) .
>
> Regards,
> Omar
>
>
> On Tue, Feb 11, 2020 at 11:37 PM Ron Cecchini <[hidden email]>
> wrote:
>
> > Hi, all.  I'm just looking for quick guidance or confirmation that I'm
> > going in the right direction here:
> >
> > - There's a small Kotlin service that uses Camel to read from Kafka and
> > write to Mongo.
> > - I need to replace Mongo with Apache Drill and write Parquet files to the
> > file system.
> >   (I know nothing about Parquet but I know a little bit about Drill.)
> >
> > - This service isn't used to do any queries, it's just for persisting data.
> >   So, given that, and the fact that Drill is just a query engine, I really
> > can't use the "Drill" component for anything.
> >
> > - But there is that "HDFS" component that I think I can use?
> >   Or maybe the "File" component is better here?
> >
> > So my thinking is that I just need to:
> >
> > 1. write a Processor to transform the JSON data into Parquet
> >    (and keep in mind that I know nothing about Parquet...)
> >
> > 2. use the HDFS (or File) component to write it to a file
> >    (I think there's some Parquet set up to do (?) outside the scope of
> > this service, but that's another matter...)
> >
> > Seems pretty straight-forward.  Does that sound reasonable?
> >
> > Are there any Camel examples I can look at?  The Google machine seems to
> > not find anything related to Camel and Parquet...
> >
> > Thank you so much!
> >
> > Ron
> >
Reply | Threaded
Open this post in threaded view
|

Re: Camel + Drill + Parquet

Omar Al-Safi
No worries Ron! And enjoy the ride :)

Regards,
Omar

On Thu, Feb 13, 2020 at 6:39 AM Ron Cecchini <[hidden email]>
wrote:

> Thanks, Omar.
>
> As it turns out, Parquet is not the way to go since it looks like it is
> geared more toward data warehousing, whereas I need to persist streaming
> data - and from what I can gather, I would need the overhead of Spark or
> Hive to accomplish that with Parquet (appending to a growing Parquet file).
>
> *However*, it looks like Apache Kudu is exactly what we need.  And not
> only does Camel already provide a Kudu component, as coincidence would have
> it it looks like you co-authored.  Awesome!
>
> Moreover, Kudu takes just a Map as input, and not an Avro formatted
> message or whatever like Parquet.  So migrating this Kafka->Mongo route to
> Kafka->Kudu is almost trivial.
>
> Anyway, time to bump up my Camel version to 3.0.1 and give Kudu a whirl...
>
> Thanks again.
>
> > On February 12, 2020 at 4:33 AM Omar Al-Safi <[hidden email]> wrote:
> >
> >
> > Hi Ron,
> >
> > By reading some introduction in Apache Drill, I'd say the file component
> > would be more suitable to write parquet files.
> > In regards to parquet and Camel, we don't have an example for it but the
> > way I see it, you are heading into the right direction by creating a
> > processor to convert the data to parquet format.
> > However, we do have an open feature request
> > <https://issues.apache.org/jira/browse/CAMEL-13573> to add parquet data
> > format, we would love to see some contributions to add this to Camel :) .
> >
> > Regards,
> > Omar
> >
> >
> > On Tue, Feb 11, 2020 at 11:37 PM Ron Cecchini <[hidden email]>
> > wrote:
> >
> > > Hi, all.  I'm just looking for quick guidance or confirmation that I'm
> > > going in the right direction here:
> > >
> > > - There's a small Kotlin service that uses Camel to read from Kafka and
> > > write to Mongo.
> > > - I need to replace Mongo with Apache Drill and write Parquet files to
> the
> > > file system.
> > >   (I know nothing about Parquet but I know a little bit about Drill.)
> > >
> > > - This service isn't used to do any queries, it's just for persisting
> data.
> > >   So, given that, and the fact that Drill is just a query engine, I
> really
> > > can't use the "Drill" component for anything.
> > >
> > > - But there is that "HDFS" component that I think I can use?
> > >   Or maybe the "File" component is better here?
> > >
> > > So my thinking is that I just need to:
> > >
> > > 1. write a Processor to transform the JSON data into Parquet
> > >    (and keep in mind that I know nothing about Parquet...)
> > >
> > > 2. use the HDFS (or File) component to write it to a file
> > >    (I think there's some Parquet set up to do (?) outside the scope of
> > > this service, but that's another matter...)
> > >
> > > Seems pretty straight-forward.  Does that sound reasonable?
> > >
> > > Are there any Camel examples I can look at?  The Google machine seems
> to
> > > not find anything related to Camel and Parquet...
> > >
> > > Thank you so much!
> > >
> > > Ron
> > >
>