May 07, 2003

Bad chars

In the past we have sometimes had problems with bad characters when converting from Unicode to Shift-JIS in Java. It seems that there are a few characters that don't make the round trip correctly. The most troubling character is Shift JIS character "817C" which maps to Unicode "ff0d". This is a dash that is commonly used in Japanese addresses but gets converted to a question mark somewhere in the process. In the past we have replaced this character with a similar dash before converting to shift JIS.

This problem also seems to be described here Conversion Mapping Inconsistency for SJIS in an Oracle technical note.

I'm writing this in the hope that someone who understands the problem more can explain where the problem lies.

Later:

The problem is that unicode "ff0d" does not round trip when you use the encoding "SJIS" but does round trip when the Microsoft version of Shift JIS "MS932".

Even Later:

Here's the deal. Unicode contains some characters that are only present to support MS932. The dash that I mentioned is really easy to type using the IME on Windows so is often found in Japanese addresses.

When Java converts this code to the "SJIS" form of Shift JIS it converts it to the question mark signifying that it is unknown. It could have converted it to a similar character like the double byte minus sign but that would have lead to the character round tripping back to exactly the same character Unicode when it was read in again. Unfortunately converting it to a question mark means that the character is useless outside of MS932 or Unicode and has to be stripped out or converted if passed to another system which doesn't use those encodings.

Another option would have been to throw an "UncovertableCharacter" exception but I guess this data trashing happens to often to make that practical.

Posted by stuartcw at May 7, 2003 04:40 PM
Comments