Apache Tika Component

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Tika Component

Bob Paulin
Hi,

I'd like to propose an Apache Tika[1] connector for Apache Camel.  I see
Camel uses a number of Tika components like PDFBox but it could be
interesting to have a full assortment of file parsers to convert files
to text.

The basic configuration would allow MIME type detection and parsing
files to text.

tika:detect

File/Inputstream -> camel-tika -> MIME Type

tika:parse

File/Inputstream ->  camel-tika -> OutputStream in text

I have a basic implementation that I'd be happy to send in a PR but I
wanted to see if this was something the community was interested in.  I
think it might be interesting to combine a project that integrates
everything with the project the parses everything.  I also think having
a camel-tika component might help achieve some of Tika's 2.0 goals.


- Bob Paulin


[1] https://tika.apache.org/

[2] https://wiki.apache.org/tika/Tika2_0RoadMap


Reply | Threaded
Open this post in threaded view
|

Re: Apache Tika Component

Chris Mattmann
Great job, Bob! ☺



On 1/22/17, 8:17 PM, "Bob Paulin" <[hidden email]> wrote:

    Hi,
   
    I'd like to propose an Apache Tika[1] connector for Apache Camel.  I see
    Camel uses a number of Tika components like PDFBox but it could be
    interesting to have a full assortment of file parsers to convert files
    to text.
   
    The basic configuration would allow MIME type detection and parsing
    files to text.
   
    tika:detect
   
    File/Inputstream -> camel-tika -> MIME Type
   
    tika:parse
   
    File/Inputstream ->  camel-tika -> OutputStream in text
   
    I have a basic implementation that I'd be happy to send in a PR but I
    wanted to see if this was something the community was interested in.  I
    think it might be interesting to combine a project that integrates
    everything with the project the parses everything.  I also think having
    a camel-tika component might help achieve some of Tika's 2.0 goals.
   
   
    - Bob Paulin
   
   
    [1] https://tika.apache.org/
   
    [2] https://wiki.apache.org/tika/Tika2_0RoadMap
   
   
   


Reply | Threaded
Open this post in threaded view
|

Re: Apache Tika Component

jbonofre
In reply to this post by Bob Paulin
Hi

It sounds like a good idea.

Regards
JB⁣​

On Jan 23, 2017, 05:18, at 05:18, Bob Paulin <[hidden email]> wrote:

>Hi,
>
>I'd like to propose an Apache Tika[1] connector for Apache Camel.  I
>see
>Camel uses a number of Tika components like PDFBox but it could be
>interesting to have a full assortment of file parsers to convert files
>to text.
>
>The basic configuration would allow MIME type detection and parsing
>files to text.
>
>tika:detect
>
>File/Inputstream -> camel-tika -> MIME Type
>
>tika:parse
>
>File/Inputstream ->  camel-tika -> OutputStream in text
>
>I have a basic implementation that I'd be happy to send in a PR but I
>wanted to see if this was something the community was interested in.  I
>think it might be interesting to combine a project that integrates
>everything with the project the parses everything.  I also think having
>a camel-tika component might help achieve some of Tika's 2.0 goals.
>
>
>- Bob Paulin
>
>
>[1] https://tika.apache.org/
>
>[2] https://wiki.apache.org/tika/Tika2_0RoadMap
Reply | Threaded
Open this post in threaded view
|

Re: Apache Tika Component

Sergey Beryozkin
In reply to this post by Bob Paulin
Hi Bob, +1

Cheers. Sergey
On 23/01/17 04:17, Bob Paulin wrote:

> Hi,
>
> I'd like to propose an Apache Tika[1] connector for Apache Camel.  I see
> Camel uses a number of Tika components like PDFBox but it could be
> interesting to have a full assortment of file parsers to convert files
> to text.
>
> The basic configuration would allow MIME type detection and parsing
> files to text.
>
> tika:detect
>
> File/Inputstream -> camel-tika -> MIME Type
>
> tika:parse
>
> File/Inputstream ->  camel-tika -> OutputStream in text
>
> I have a basic implementation that I'd be happy to send in a PR but I
> wanted to see if this was something the community was interested in.  I
> think it might be interesting to combine a project that integrates
> everything with the project the parses everything.  I also think having
> a camel-tika component might help achieve some of Tika's 2.0 goals.
>
>
> - Bob Paulin
>
>
> [1] https://tika.apache.org/
>
> [2] https://wiki.apache.org/tika/Tika2_0RoadMap
>
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/
Reply | Threaded
Open this post in threaded view
|

Re: Apache Tika Component

Claus Ibsen-2
In reply to this post by Bob Paulin
Hi Bob

Sounds great. We love contributions. So just keep hack on this and let
us know when you have something working.

We love contributions
http://camel.apache.org/contributing.html


There is a new component guide here
http://camel.apache.org/add-new-component-guide.html

On Mon, Jan 23, 2017 at 5:17 AM, Bob Paulin <[hidden email]> wrote:

> Hi,
>
> I'd like to propose an Apache Tika[1] connector for Apache Camel.  I see
> Camel uses a number of Tika components like PDFBox but it could be
> interesting to have a full assortment of file parsers to convert files
> to text.
>
> The basic configuration would allow MIME type detection and parsing
> files to text.
>
> tika:detect
>
> File/Inputstream -> camel-tika -> MIME Type
>
> tika:parse
>
> File/Inputstream ->  camel-tika -> OutputStream in text
>
> I have a basic implementation that I'd be happy to send in a PR but I
> wanted to see if this was something the community was interested in.  I
> think it might be interesting to combine a project that integrates
> everything with the project the parses everything.  I also think having
> a camel-tika component might help achieve some of Tika's 2.0 goals.
>
>
> - Bob Paulin
>
>
> [1] https://tika.apache.org/
>
> [2] https://wiki.apache.org/tika/Tika2_0RoadMap
>
>



--
Claus Ibsen
-----------------
http://davsclaus.com @davsclaus
Camel in Action 2: https://www.manning.com/ibsen2
Reply | Threaded
Open this post in threaded view
|

Re: Apache Tika Component

Bob Paulin-2
In reply to this post by Bob Paulin
Thanks for the link to the component guide!  Looks like I have most of
this.  Let me check all the boxes then I'll send send over the PR for
feedback.

- Bob

On 2017-01-23 06:30 (-0600), Claus Ibsen <[hidden email]> wrote:

> Hi Bob
>
> Sounds great. We love contributions. So just keep hack on this and let
> us know when you have something working.
>
> We love contributions
> http://camel.apache.org/contributing.html
>
>
> There is a new component guide here
> http://camel.apache.org/add-new-component-guide.html
>
> On Mon, Jan 23, 2017 at 5:17 AM, Bob Paulin <[hidden email]> wrote:
> > Hi,
> >
> > I'd like to propose an Apache Tika[1] connector for Apache Camel. I see
> > Camel uses a number of Tika components like PDFBox but it could be
> > interesting to have a full assortment of file parsers to convert files
> > to text.
> >
> > The basic configuration would allow MIME type detection and parsing
> > files to text.
> >
> > tika:detect
> >
> > File/Inputstream -> camel-tika -> MIME Type
> >
> > tika:parse
> >
> > File/Inputstream -> camel-tika -> OutputStream in text
> >
> > I have a basic implementation that I'd be happy to send in a PR but I
> > wanted to see if this was something the community was interested in. I
> > think it might be interesting to combine a project that integrates
> > everything with the project the parses everything. I also think having
> > a%