Train of Sort #3 - Unicode version inconsistency

| No Comments | No TrackBacks

I forgot to announce that I had written an introduction to XML Schema Inference (it was nearly a month ago). I tried to describe what XML Schema Inference can do. As a technically-neutral person, I am also positive to describe what it cannot do.

I think this is also applicable to other xml structure inferences. Actually I applied it to my RelaxngInference which is however not complete. RELAX NG is rather flexible and you can add choice alternatives mostly anytime, but the resulting grammar will become sort of garbage ;-) Thus, simply-ruled structure still does make sense.

Ok, here is the third chapter of sort of crap train of sort.

Train of Sort #3 - Unicode version inconsistency

On surfing Japanese web pages, I came across aforum that there was a Japanese people in trouble that String.IndexOf() ignores U+3007 ("ideographic number zero" in "CJK symbols and punctuation") and thus his application went into infinite loop.

string s = "-\u3007-"; while (s.IndexOf ("--") >= 0) s = s.Replace ("--", "-");

That code should be changed to use CompareInfo.IndexOf (s, "--", CompareOptions.Ordinal), but it seemed too difficult for those guys who posts there to understand why such infinite loop happen. That U+3007 is a number "zero" and on comparing strings "10000" and "1" are regarded as equivalent if those zeros are actually U+3007. I wonder if that is acceptable for Japanese accounting softwares.

Since there is no design contract for Windows collation, I have even no idea if it is a bug in Windows or this is Microsoft's culture (string collation is culture dependent matter and there is no definition).

When I was digging into CompareOptions, I was really mazed. At first I thought that IgnoreSymbols just indicates to omit characters whose unicode category is BlahBlahSymbol and IgnoreNonSpace indicates to omit NonSpacingMark characters. It is totally wrong assumption. Many characters were ignorable than I have expected. For example, Char.UnicodeCategory() returns "LetterNumber" for U+3007 above.

That was unfortunately before I got to know that CompareInfo() is based on LCMapString whose behavior comes from Windows 95. Since there was no Unicode 3.1 it is not likely to happen that it matches with what Char.GetUnicodeCategory() returns for each character. However, there are non-ignorable characters which are added at Unicode 2.0 and 3.0 stages (latter than Windows 95). They are still in the dark age.

Thus, there are several characters that are "ignored" in CompareInfo.Compare(): Sinhala, Tibetan, Myanmar, Etiopic, Cherokee, Ogham, Runic, Tagalog, Hanunoo, Philippine, Buhid, Tagbanwa, Khmer and Mongolian characters, and Yi syllables and more - such as Braille patterns.

There should be no reason not to be consistent with the other Unicode related stuff such as System.Text.UnicodeEncoding, System.Char.GetUnicodeCategory() and especially String.Normalize().

So, the ideal solution here would be to use Unicode 3.1 table (further ideal solution is to use the latest Unicode 4.1 for all the related stuff). So I would really like to push Microsoft to discard existing CompareInfo (leave it as is) and add collation support in an alternative class (or inside CharUnicodeInfo or whatever). Sadly it is impossible just to upgrade Unicode version used in LCMapString, keeping it compatible. It is because of some inconsistent character categories (well, anyways .NET 2.0 upgraded the source of UnicodeCategory thus GetUnicodeCategory() returns inconsistent values for some characters though) and a design failure in Windows sortkey table (I'll describle more about it maybe next time).

No TrackBacks

TrackBack URL: http://veritas-vos-liberabit.com/monogatari/mt-tb.cgi/40

Leave a comment