File Encoding falls back to default encoding while grouping after split using tokenize

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

File Encoding falls back to default encoding while grouping after split using tokenize

Karthick K R
I am quite new to Apache Camel. But after using it for a month now, I really feel it is a great Integration framework which makes solving various enterprise problems very effectively with minimal effort.

Coming to the issue, I had been working on splitting a huge csv using the splitter with tokenize & grouping N lines approach and ran into encoding issues with the grouped content.
A similar issue had been raised in StackOverflow: Camel: UTF-8 Encoding is lost after using Group

I had also commented on the same issue with my usecase and observations made. Including the same text here:

Sample csv file: (with Delimiter - '|') CandidateNumber|CandidateLastName|CandidateFirstName|EducationLevel

CAND123C001|Wells|Jimmy|Bachelor's Degree (±16 years)

CAND123C002|Wells|Tom|Bachelor's Degree (±16 years)

CAND123C003|Wells|James|Bachelor's Degree (±16 years)

CAND123C004|Wells|Tim|Bachelor's Degree (±16 years)

The ± character is corrupted after tokenize with grouping. I was initially under the assumption that the problem was with not setting the proper File Encoding for split, but the exchange seems to have the right value for property CamelCharsetName=ISO-8859-1.

from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n",2,true)).streaming()
.log("body: ${body}");

The same works fine with dont use grouping.

from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n")).streaming()
.log("body: ${body}");

Looking at GroupTokenIterator in camel code base the problem seems to be with the way TypeConverter is used to convert String to InputStream

// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, data);
...
Note: the mandatoryConvertTo() has an overloaded method with exchange

<T> T mandatoryConvertTo(Class<T> type, Exchange exchange, Object value)
As the exchange is not passed as argument it always falls back to default charset set using system property "org.apache.camel.default.charset"

Potential Fix:

// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, exchange, data);
...
As this fix is in the camel-core, another potential option is to use split without grouping and use AgrregateStrategy with completionSize() and completionTimeout().

Although it would be great to get this fixed in camel-core.

Kindly let me know your thoughts and as to whether this can be handled in a different way.
Loading...