I don't like opt-out messages

Recently I often look back the last productive year. After Mono 1.0 release I have mostly done System.Xml 2.0 bits (editable XPathNavigator, XmlSchemaInference, XmlSchemaValidator, XmlReader.Create() and half-baked XQuery engine ... none of them rock though. I remember I once started to improve DataSet bits (DataView and related things), but now I don't work on them. Maybe something happened this year and I have done nothing wrt xml bits (well, except for NvdlValidatingReader). Maybe I need something interesting there.

...Anyways, here is the final chapter of Train of ~~Rant~~Sort.

Train of Sort #5 - Is UCD (not UCA) used? -

I have wondered if Windows collator uses Unicode Character Database, the core of Unicode specification.

Windows treats U+00C6 (Latin small letter AE; Latin smal ligature AE) as equivalent to "ae". There is a character | U+01E3 (http://www.fileformat.info/info/unicode/char/01E3/index.htm) aka Latin small letter AE with Macron whose Normalization Form Decomposition (NFD) is U+00E6 U+0304 where U+0304 is Combining Macron.

So it looks like we can expect that Compare("\u01E3", "\u00E6\u0304") returns 0. However they are treated as different. Windows treats U+01E3 as equivalent to the sequence 'a' 'U+0304' 'e' 'U+0304'.

Similarly, circled CJK compatibility characters are not always related to ("related" actually means nothing but close primary weight though) non-circled one. For example U+32A2 (whose NFKD value is circled U+5199) is sorted next to U+5BEB (traditional Chinese) instead of U+5199 (Japanese and possibly simplified Chinese).

Those facts do not look like because Windows works differently than UCA. And since this NFD mapping comes from Unicode 1.1 that is released in 1993 (earlier than Windows 95) so they also do not look like because of the time lag. I wonder why those differences happen.

In fact recently I wrote about those CJK stuff to Michael Kaplan's blog entry and thanks to his courtesy I got some answers. That sadly seems irrelevant to the problem above, but he had an enlighting point - NFKD often breaks significant bits. I guess he is right (since some NFKD mappings contain hyphens which is treated specially), but I also think it depends on the implementation (am not clear about such cases at this stage).

It reminded me of some kind of contractions - when I read some materials on Hungarian sorting, I tried comparing "Z", "zs", "zzs" and so on. And on seeing sortkeys, I eventually found that "zszs" and "zzs" are regarded as equivalent on .NET. But I still don't know if it is incorrect or not (even after seeing J2SE 1.5 didn't regard them as equivalent). I doubt they are kinda source of destructive operations, but not sure.

... that's all for now (I feel sorry that there is no "conclusion" here). Windows Collation things are still in the black box, and for some serious development we certainly need explicit design to predict what would happen with collations.

Ok, time to hibernate. My writing battery is going to run out.

I don't like opt-out messages

Train of Sort #5 - Is UCD (not UCA) used? -

No TrackBacks

Leave a comment

Search

About this Entry

Categories

Monthly Archives

Pages