File encoding auto-detection for ZK Fileupload component

| Comments

Proper handling of file encoding can be a royal headache. Recently I spent an unreasonable amount of time trying to figure out why ZK’s Fileupload component was messing with the contents of my CSV file:

The data should be: código redome;;código hemocentro;nome;nome da mãe;

The file was generated with Windows Excel, using the encoding ISO-8859-2 (common encoding for Windows). After some investigation I found out that Fileupload by default treats all files with content type text/... as UTF-8! Ouch!!

If all your files will be generated using the same file encoding, this can be fixed with the following configuration:

1
2
3
4
5
6
7
<zk>
  <system-config>
    <upload-charset>YOUR ENCODING</upload-charset> <!-- ISO-8859-2 in my case -->
    ...
  </system-config>
  ...
</zk>

The upload-charset tag did the trick! But what if my user decides to move to a different (better?) platform in the future, and generates the file with UTF-8? Or any other encoding?

Then the proper solution is to use the tag upload-charset-finder-class:

1
<upload-charset-finder-class>a_class_name</upload-charset-finder-class>

The documentation says that this class has to implement the interface CharsetFinder and its sole method String getCharset(String contentType, InputStream content):

When a text file is uploaded, the getCharset method is called and it can determines the encoding based on the content type and/or the content of the uploaded file.

Which leads us to the main reason of this post: How to detect the file encoding, if ZK itself does not provide a default implementation for this interface?

More research pointed me to some solutions, but the one that I ended up implementing was using the Apache Any23. It includes the TikaEncodingDetector, that can be used to auto-detect the file encoding of a stream. The final code for the CharsetFinder implementation is the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
package util;

import org.apache.any23.encoding.TikaEncodingDetector;
import org.zkoss.zk.ui.util.CharsetFinder;

import java.nio.charset.Charset;

public class TikaCharsetFinder implements CharsetFinder {
    @Override
    public String getCharset(String contentType, InputStream content) throws IOException {
        return new TikaEncodingDetector().guessEncoding(content);
    }
}

Yep, it is that simple. The final ZK configuration to use this class is:

1
2
3
4
5
6
7
<zk>
  <system-config>
    <upload-charset-finder-class>util.TikaCharsetFinder</upload-charset-finder-class>
    ...
  </system-config>
  ...
</zk>

Comments