Re: Bindy plus Unicode

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Bindy plus Unicode

Alex Dettinger
Hi Michael,

    I was just looking at this component for another purpose and it looks
to me that fixed length tokenzation occurs here:

https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
  So, It counts in java chars and not code points. You can maybe experiment
injecting a custom BindyFixedLengthFactory, via
dataFormat.setModelFactory(..).

  Would you feel that an extension point to customize count/selection of
chars/codepoint/grapheme would be valuable to the community, feel free to
raise a JIRA ticket.

Alex


On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <[hidden email]>
wrote:

> Hi,
>
> I’m having problems with the bindy component and wonder if there is
> something I missed. Maybe one can help me addressing it. I cannot believe,
> that I’m the first to hit this problem.
>
> I need to port an EAI application built using bindy, that reads a fixed
> type file(*) converts it and sends the data somewhere else. Currently this
> file is in Latin 1 encoding, but we need to take it to Unicode –
> effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> application that creates the file.
>
> Unicode is a bit tricky, when it comes to counting the length of a string
> specially since Java uses internally UTF-16, which means depending on the
> codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> substring and counts chars like Java does. This means the length of a
> string is the count of the chars, i.e. UTF-16 surrogates, but not
> codepoints, which is the common denominator (e.g. see definition of string
> length in XMLSchema). And when one takes combing chars into account (one
> “base char” plus 0 – n combining chars are perceived as one “char” by
> users) it becomes even more of a problem.
>
> Is there a possibility to tell bindy how it counts an and selects the
> tokens based on char counts in a given line? Any suggestions? Is the are
> related bug or change to come that addresses this problem?
>
> -- Mik
>
> (*) This means, that on certain positions there start certain data
> (columns if you will).
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re: Bindy plus Unicode

Michael Greulich-2

Hi Alex,

well, your comment was already very helpful. I created a custom DataFormat and ModelFactory from the default ones for FixedLength. Of course I obeyed the license terms of the Apache license ;-) For some aspect of recognizing chars, I used the ICU4J-lib, because the support for some things (e.g. emojis) in the Java runtime is not up to date. The license of ICU it quite permitting, too. I’ve no idea, if this is a problem for an Apache project...

Well I think I’m not the only one, that has this use-case -- so I  think this can be useful for the community, too. Currently I’m under pressure, but I think I will create a JIRA ticket when the stress has become less. If the community is interested, I can provide the code of my solution and would be glad if this thing goes upstream (i.e. into the camel distro) some day.

Currently we (the company I work for) are using Camel 2.2 and I guess this will be the case for some time. If this feature or bug (not very determined what it actually is, I will leave the decision to the community)  in which version will it be included? Only Camel 3.x or will it be backported to 2.2?

-- Mik
  
--------------------------------------------------------------------------
Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
Von: "Alex Dettinger" <[hidden email]>
An: [hidden email]
Betreff: Re: Bindy plus Unicode
Hi Michael,

I was just looking at this component for another purpose and it looks
to me that fixed length tokenzation occurs here:

https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
So, It counts in java chars and not code points. You can maybe experiment
injecting a custom BindyFixedLengthFactory, via
dataFormat.setModelFactory(..).

Would you feel that an extension point to customize count/selection of
chars/codepoint/grapheme would be valuable to the community, feel free to
raise a JIRA ticket.

Alex


On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <[hidden email]>
wrote:

> Hi,
>
> I’m having problems with the bindy component and wonder if there is
> something I missed. Maybe one can help me addressing it. I cannot believe,
> that I’m the first to hit this problem.
>
> I need to port an EAI application built using bindy, that reads a fixed
> type file(*) converts it and sends the data somewhere else. Currently this
> file is in Latin 1 encoding, but we need to take it to Unicode –
> effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> application that creates the file.
>
> Unicode is a bit tricky, when it comes to counting the length of a string
> specially since Java uses internally UTF-16, which means depending on the
> codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> substring and counts chars like Java does. This means the length of a
> string is the count of the chars, i.e. UTF-16 surrogates, but not
> codepoints, which is the common denominator (e.g. see definition of string
> length in XMLSchema). And when one takes combing chars into account (one
> “base char” plus 0 – n combining chars are perceived as one “char” by
> users) it becomes even more of a problem.
>
> Is there a possibility to tell bindy how it counts an and selects the
> tokens based on char counts in a given line? Any suggestions? Is the are
> related bug or change to come that addresses this problem?
>
> -- Mik
>
> (*) This means, that on certain positions there start certain data
> (columns if you will).
>
>
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Re: Bindy plus Unicode

Alex Dettinger
Hi Michael,

  Good to know that you sorted it out :) The compatibility between the
ICU4L and Apache License is not straightforward, we would need to look
closer.
Still creating a quick ticket and sharing a github project would make it
possible to save your work, and may be of interest later on to the
community.
  Would one provide a PR against 3.x, chances are that this could be
back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of
this year.

Alex

On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <[hidden email]>
wrote:

>
> Hi Alex,
>
> well, your comment was already very helpful. I created a custom DataFormat
> and ModelFactory from the default ones for FixedLength. Of course I obeyed
> the license terms of the Apache license ;-) For some aspect of recognizing
> chars, I used the ICU4J-lib, because the support for some things (e.g.
> emojis) in the Java runtime is not up to date. The license of ICU it quite
> permitting, too. I’ve no idea, if this is a problem for an Apache project...
>
> Well I think I’m not the only one, that has this use-case -- so I  think
> this can be useful for the community, too. Currently I’m under pressure,
> but I think I will create a JIRA ticket when the stress has become less. If
> the community is interested, I can provide the code of my solution and
> would be glad if this thing goes upstream (i.e. into the camel distro) some
> day.
>
> Currently we (the company I work for) are using Camel 2.2 and I guess this
> will be the case for some time. If this feature or bug (not very determined
> what it actually is, I will leave the decision to the community)  in which
> version will it be included? Only Camel 3.x or will it be backported to 2.2?
>
> -- Mik
>
> --------------------------------------------------------------------------
> Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
> Von: "Alex Dettinger" <[hidden email]>
> An: [hidden email]
> Betreff: Re: Bindy plus Unicode
> Hi Michael,
>
> I was just looking at this component for another purpose and it looks
> to me that fixed length tokenzation occurs here:
>
>
> https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
> So, It counts in java chars and not code points. You can maybe experiment
> injecting a custom BindyFixedLengthFactory, via
> dataFormat.setModelFactory(..).
>
> Would you feel that an extension point to customize count/selection of
> chars/codepoint/grapheme would be valuable to the community, feel free to
> raise a JIRA ticket.
>
> Alex
>
>
> On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <
> [hidden email]>
> wrote:
>
> > Hi,
> >
> > I’m having problems with the bindy component and wonder if there is
> > something I missed. Maybe one can help me addressing it. I cannot
> believe,
> > that I’m the first to hit this problem.
> >
> > I need to port an EAI application built using bindy, that reads a fixed
> > type file(*) converts it and sends the data somewhere else. Currently
> this
> > file is in Latin 1 encoding, but we need to take it to Unicode –
> > effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> > application that creates the file.
> >
> > Unicode is a bit tricky, when it comes to counting the length of a string
> > specially since Java uses internally UTF-16, which means depending on the
> > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> > substring and counts chars like Java does. This means the length of a
> > string is the count of the chars, i.e. UTF-16 surrogates, but not
> > codepoints, which is the common denominator (e.g. see definition of
> string
> > length in XMLSchema). And when one takes combing chars into account (one
> > “base char” plus 0 – n combining chars are perceived as one “char” by
> > users) it becomes even more of a problem.
> >
> > Is there a possibility to tell bindy how it counts an and selects the
> > tokens based on char counts in a given line? Any suggestions? Is the are
> > related bug or change to come that addresses this problem?
> >
> > -- Mik
> >
> > (*) This means, that on certain positions there start certain data
> > (columns if you will).
> >
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re: Re: Bindy plus Unicode

Michael Greulich-2

Hi Alex,
 
well, which would then be the appropriate branch? Master or 3.x?
I guess if i create a ticket I get informed by e-mail what happens to the thing, right?
I think there could be a ticket + PR in the next two weeks.

I word on ICU4J. Of course I understand, that an Apache Project has to be careful, but there
are features like splitting strings into graphemes, that need features, the old logic in the JDK
doesn't support. The lib is very common (e.g. LibreOffice uses it) and AFAIK the de-facto standard
for working with elaborate Unicode.

-- Mik
 
----
Gesendet: Freitag, 24. Januar 2020 um 19:15 Uhr
Von: "Alex Dettinger" <[hidden email]>
An: [hidden email]
Betreff: Re: Re: Bindy plus Unicode
Hi Michael,

Good to know that you sorted it out :) The compatibility between the
ICU4L and Apache License is not straightforward, we would need to look
closer.
Still creating a quick ticket and sharing a github project would make it
possible to save your work, and may be of interest later on to the
community.
Would one provide a PR against 3.x, chances are that this could be
back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of
this year.

Alex

On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <[hidden email]>
wrote:

>
> Hi Alex,
>
> well, your comment was already very helpful. I created a custom DataFormat
> and ModelFactory from the default ones for FixedLength. Of course I obeyed
> the license terms of the Apache license ;-) For some aspect of recognizing
> chars, I used the ICU4J-lib, because the support for some things (e.g.
> emojis) in the Java runtime is not up to date. The license of ICU it quite
> permitting, too. I’ve no idea, if this is a problem for an Apache project...
>
> Well I think I’m not the only one, that has this use-case -- so I think
> this can be useful for the community, too. Currently I’m under pressure,
> but I think I will create a JIRA ticket when the stress has become less. If
> the community is interested, I can provide the code of my solution and
> would be glad if this thing goes upstream (i.e. into the camel distro) some
> day.
>
> Currently we (the company I work for) are using Camel 2.2 and I guess this
> will be the case for some time. If this feature or bug (not very determined
> what it actually is, I will leave the decision to the community) in which
> version will it be included? Only Camel 3.x or will it be backported to 2.2?
>
> -- Mik
>
> --------------------------------------------------------------------------
> Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
> Von: "Alex Dettinger" <[hidden email]>
> An: [hidden email]
> Betreff: Re: Bindy plus Unicode
> Hi Michael,
>
> I was just looking at this component for another purpose and it looks
> to me that fixed length tokenzation occurs here:
>
>
> https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
> So, It counts in java chars and not code points. You can maybe experiment
> injecting a custom BindyFixedLengthFactory, via
> dataFormat.setModelFactory(..).
>
> Would you feel that an extension point to customize count/selection of
> chars/codepoint/grapheme would be valuable to the community, feel free to
> raise a JIRA ticket.
>
> Alex
>
>
> On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <
> [hidden email]>
> wrote:
>
> > Hi,
> >
> > I’m having problems with the bindy component and wonder if there is
> > something I missed. Maybe one can help me addressing it. I cannot
> believe,
> > that I’m the first to hit this problem.
> >
> > I need to port an EAI application built using bindy, that reads a fixed
> > type file(*) converts it and sends the data somewhere else. Currently
> this
> > file is in Latin 1 encoding, but we need to take it to Unicode –
> > effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> > application that creates the file.
> >
> > Unicode is a bit tricky, when it comes to counting the length of a string
> > specially since Java uses internally UTF-16, which means depending on the
> > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> > substring and counts chars like Java does. This means the length of a
> > string is the count of the chars, i.e. UTF-16 surrogates, but not
> > codepoints, which is the common denominator (e.g. see definition of
> string
> > length in XMLSchema). And when one takes combing chars into account (one
> > “base char” plus 0 – n combining chars are perceived as one “char” by
> > users) it becomes even more of a problem.
> >
> > Is there a possibility to tell bindy how it counts an and selects the
> > tokens based on char counts in a given line? Any suggestions? Is the are
> > related bug or change to come that addresses this problem?
> >
> > -- Mik
> >
> > (*) This means, that on certain positions there start certain data
> > (columns if you will).
> >
> >
>
>
>
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Bindy plus Unicode

Alex Dettinger
Hi Michael,

  You would need to open a PR against master.
  Please, find some helpful information around contributions
https://camel.apache.org/manual/latest/contributing.html.

  I'm sure ICU4J is functionally great. However, license compatibility is a
legal matter, we don't really have choice.
  Could you please point to the ICU4J license you've been using ? I could
have a try with checking the compatibility.

Alex

On Sat, Jan 25, 2020 at 5:42 PM <[hidden email]> wrote:

>
> Hi Alex,
>
> well, which would then be the appropriate branch? Master or 3.x?
> I guess if i create a ticket I get informed by e-mail what happens to the
> thing, right?
> I think there could be a ticket + PR in the next two weeks.
>
> I word on ICU4J. Of course I understand, that an Apache Project has to be
> careful, but there
> are features like splitting strings into graphemes, that need features,
> the old logic in the JDK
> doesn't support. The lib is very common (e.g. LibreOffice uses it) and
> AFAIK the de-facto standard
> for working with elaborate Unicode.
>
> -- Mik
>
> ----
> Gesendet: Freitag, 24. Januar 2020 um 19:15 Uhr
> Von: "Alex Dettinger" <[hidden email]>
> An: [hidden email]
> Betreff: Re: Re: Bindy plus Unicode
> Hi Michael,
>
> Good to know that you sorted it out :) The compatibility between the
> ICU4L and Apache License is not straightforward, we would need to look
> closer.
> Still creating a quick ticket and sharing a github project would make it
> possible to save your work, and may be of interest later on to the
> community.
> Would one provide a PR against 3.x, chances are that this could be
> back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of
> this year.
>
> Alex
>
> On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <[hidden email]>
> wrote:
>
> >
> > Hi Alex,
> >
> > well, your comment was already very helpful. I created a custom
> DataFormat
> > and ModelFactory from the default ones for FixedLength. Of course I
> obeyed
> > the license terms of the Apache license ;-) For some aspect of
> recognizing
> > chars, I used the ICU4J-lib, because the support for some things (e.g.
> > emojis) in the Java runtime is not up to date. The license of ICU it
> quite
> > permitting, too. I’ve no idea, if this is a problem for an Apache
> project...
> >
> > Well I think I’m not the only one, that has this use-case -- so I think
> > this can be useful for the community, too. Currently I’m under pressure,
> > but I think I will create a JIRA ticket when the stress has become less.
> If
> > the community is interested, I can provide the code of my solution and
> > would be glad if this thing goes upstream (i.e. into the camel distro)
> some
> > day.
> >
> > Currently we (the company I work for) are using Camel 2.2 and I guess
> this
> > will be the case for some time. If this feature or bug (not very
> determined
> > what it actually is, I will leave the decision to the community) in which
> > version will it be included? Only Camel 3.x or will it be backported to
> 2.2?
> >
> > -- Mik
> >
> >
> --------------------------------------------------------------------------
> > Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
> > Von: "Alex Dettinger" <[hidden email]>
> > An: [hidden email]
> > Betreff: Re: Bindy plus Unicode
> > Hi Michael,
> >
> > I was just looking at this component for another purpose and it looks
> > to me that fixed length tokenzation occurs here:
> >
> >
> >
> https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
> > So, It counts in java chars and not code points. You can maybe experiment
> > injecting a custom BindyFixedLengthFactory, via
> > dataFormat.setModelFactory(..).
> >
> > Would you feel that an extension point to customize count/selection of
> > chars/codepoint/grapheme would be valuable to the community, feel free to
> > raise a JIRA ticket.
> >
> > Alex
> >
> >
> > On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <
> > [hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > I’m having problems with the bindy component and wonder if there is
> > > something I missed. Maybe one can help me addressing it. I cannot
> > believe,
> > > that I’m the first to hit this problem.
> > >
> > > I need to port an EAI application built using bindy, that reads a fixed
> > > type file(*) converts it and sends the data somewhere else. Currently
> > this
> > > file is in Latin 1 encoding, but we need to take it to Unicode –
> > > effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> > > application that creates the file.
> > >
> > > Unicode is a bit tricky, when it comes to counting the length of a
> string
> > > specially since Java uses internally UTF-16, which means depending on
> the
> > > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for
> selection
> > > substring and counts chars like Java does. This means the length of a
> > > string is the count of the chars, i.e. UTF-16 surrogates, but not
> > > codepoints, which is the common denominator (e.g. see definition of
> > string
> > > length in XMLSchema). And when one takes combing chars into account
> (one
> > > “base char” plus 0 – n combining chars are perceived as one “char” by
> > > users) it becomes even more of a problem.
> > >
> > > Is there a possibility to tell bindy how it counts an and selects the
> > > tokens based on char counts in a given line? Any suggestions? Is the
> are
> > > related bug or change to come that addresses this problem?
> > >
> > > -- Mik
> > >
> > > (*) This means, that on certain positions there start certain data
> > > (columns if you will).
> > >
> > >
> >
> >
> >
>
>
>