Train of Sort #4 - sortkey table optimization

| No Comments | No TrackBacks

With related to sortkey table and logical surface of collation, I think UCA handles things much better than Windows. Well, I didn't like "logical order exceptions" in UCA, but it disappeared in the latest specification.

  • For "shifted" characters such as hyphens and apostrophe characters, GetSortKey() considers those characters only at the identical weight. In that case, only primary weight is computed as the actual sortkey value (like 01 01 01 01 80 07 06 80 for ASCII Apostrophe where only 06 80 represent that character). BTW there are such pairs of two characters that differ only in thirtiary weight (width sensitivity lies there) such as U+0027 and U+FF07 (full-width Apostrophe). Those sortkeys are regarded as equivalent when StringSort is not specified as CompareOptions flags (it does not happen when you use Compare()).

  • Windows has no concept of (non-)blocking diacritical marks. Windows just adds diacritical weight of nonspacing characters to that weight of previous primary character. And since the weight does not handle value more than 255, it often overflows and goes back to 0 (BTW the value 0 indicates the end of the sortkey in Windows API). Moreover, the fact means that there could be several pairs of sequences of diacritical marks that are incorrectly treated as equivalent. I recently mentioned one of such case in mono-devel-list.

  • Windows collation table is not internally optimizable. For example, (1)Hangul Syllables are sorted, being mixed with Jamo. In UCA it is implemented to be optimizable, but on Windows L/V/T forms are mixed depending on character equivalence and no blank values between them, so the large Hangul Syllable block could not be computed but must just be filled in a bulky table. In UCA and CLDR there is "optimize" option to take. Another example is (2)CJK compatibility characters. Those "compatibility" characters can, however, never be compatible with any of CompareOptions flags. Instead they are sorted next to the "compatible" character. However, if there is no such design, the sortkey table could be impressively optimized since the large chunk of CJK ideograph characters could be computed from their codepoints. I think they could be differentiated only in the third weight area (where width and case insensitivity coexist), or the fourth weight area (as well as Japanese Kana, since anyways those characters differ in primary level).

  • CJK sortkeys and CJK exntension sortkeys are immediately joined. On Windows, CJK ideograph ends at U+9FA5. In the latest Unicode the last CJK ideograph character is U+9FBB. But there is no proper space for U+9FA6-U+9FBB in Windows sortkey table (the sortkey for U+9FA5 is F0 B4 01 01 01 01 and U+F900 has F0 B5 01 01 01 01). The fact tells that it is impossible for CompareInfo/LCMapString to support higher Unicode versions, with existing Windows sortkey design. That is one of the reason that Windows can never support sorting in higher Unicode standards than version 1.0 which makes sense, as long as they keep sortkey values compatible.

No TrackBacks

TrackBack URL: http://veritas-vos-liberabit.com/monogatari/mt-tb.cgi/41

Leave a comment