Handling a byte order mark in Java

Occasionally you'll need to interact with text files that have a leading byte-order mark. This signals which charset should be used for converting bytes to characters.

Unfortunately, the standard library doesn't include any facility to do this automatically. Instead, cut & paste this code to convert any raw InputStream into a Reader of the appropriate type.
  public Reader inputStreamToReader(InputStream in) throws IOException {
    in.mark(3);
    int byte1 = in.read();
    int byte2 = in.read();
    if (byte1 == 0xFF && byte2 == 0xFE) {
      return new InputStreamReader(in, "UTF-16LE");
    } else if (byte1 == 0xFF && byte2 == 0xFF) {
      return new InputStreamReader(in, "UTF-16BE");
    } else {
      int byte3 = in.read();
      if (byte1 == 0xEF && byte2 == 0xBB && byte3 == 0xBF) {
        return new InputStreamReader(in, "UTF-8");
      } else {
        in.reset();
        return new InputStreamReader(in);
      }
    }
  }

4 comments:

schlosna said...

Are there any plans to include this in Guava?

It might be worthwhile to ensure that if InputStream.markSupported() returns false, you'd return something like a SequenceInputStream that prepends the bytes read, and then the remaining InputStream.

Tomi said...

The UTF-LE BOM is FF FE not FF FF

Unknown said...

Thanks for the useful example!


I notice that the utf16-be case is wrong though: it checks both bytes for 0xFF; byte1 should be checked for 0xFE.


If the stream doesn't support mark (I had this with a GZipInputStream) you could wrap the InputStream in a BufferedInputStream:




if (!in.markSupported()) {

in = new BufferedInputStream(in);

}

in.mark(3);

Jeremy said...

Thanks, this is great, but there's a mistake in UTF-16BE. It should be FE FF.

See http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_boms
for more.