June 2005 Archives

I don't like opt-out messages

| No Comments | No TrackBacks

Recently I often look back the last productive year. After Mono 1.0 release I have mostly done System.Xml 2.0 bits (editable XPathNavigator, XmlSchemaInference, XmlSchemaValidator, XmlReader.Create() and half-baked XQuery engine ... none of them rock though. I remember I once started to improve DataSet bits (DataView and related things), but now I don't work on them. Maybe something happened this year and I have done nothing wrt xml bits (well, except for NvdlValidatingReader). Maybe I need something interesting there.

...Anyways, here is the final chapter of Train of RantSort.

Train of Sort #5 - Is UCD (not UCA) used? -

I have wondered if Windows collator uses Unicode Character Database, the core of Unicode specification.

Windows treats U+00C6 (Latin small letter AE; Latin smal ligature AE) as equivalent to "ae". There is a character | U+01E3 (http://www.fileformat.info/info/unicode/char/01E3/index.htm) aka Latin small letter AE with Macron whose Normalization Form Decomposition (NFD) is U+00E6 U+0304 where U+0304 is Combining Macron.

So it looks like we can expect that Compare("\u01E3", "\u00E6\u0304") returns 0. However they are treated as different. Windows treats U+01E3 as equivalent to the sequence 'a' 'U+0304' 'e' 'U+0304'.

Similarly, circled CJK compatibility characters are not always related to ("related" actually means nothing but close primary weight though) non-circled one. For example U+32A2 (whose NFKD value is circled U+5199) is sorted next to U+5BEB (traditional Chinese) instead of U+5199 (Japanese and possibly simplified Chinese).

Those facts do not look like because Windows works differently than UCA. And since this NFD mapping comes from Unicode 1.1 that is released in 1993 (earlier than Windows 95) so they also do not look like because of the time lag. I wonder why those differences happen.

In fact recently I wrote about those CJK stuff to Michael Kaplan's blog entry and thanks to his courtesy I got some answers. That sadly seems irrelevant to the problem above, but he had an enlighting point - NFKD often breaks significant bits. I guess he is right (since some NFKD mappings contain hyphens which is treated specially), but I also think it depends on the implementation (am not clear about such cases at this stage).

It reminded me of some kind of contractions - when I read some materials on Hungarian sorting, I tried comparing "Z", "zs", "zzs" and so on. And on seeing sortkeys, I eventually found that "zszs" and "zzs" are regarded as equivalent on .NET. But I still don't know if it is incorrect or not (even after seeing J2SE 1.5 didn't regard them as equivalent). I doubt they are kinda source of destructive operations, but not sure.

... that's all for now (I feel sorry that there is no "conclusion" here). Windows Collation things are still in the black box, and for some serious development we certainly need explicit design to predict what would happen with collations.

Ok, time to hibernate. My writing battery is going to run out.

Train of Sort #4 - sortkey table optimization

| No Comments | No TrackBacks

With related to sortkey table and logical surface of collation, I think UCA handles things much better than Windows. Well, I didn't like "logical order exceptions" in UCA, but it disappeared in the latest specification.

  • For "shifted" characters such as hyphens and apostrophe characters, GetSortKey() considers those characters only at the identical weight. In that case, only primary weight is computed as the actual sortkey value (like 01 01 01 01 80 07 06 80 for ASCII Apostrophe where only 06 80 represent that character). BTW there are such pairs of two characters that differ only in thirtiary weight (width sensitivity lies there) such as U+0027 and U+FF07 (full-width Apostrophe). Those sortkeys are regarded as equivalent when StringSort is not specified as CompareOptions flags (it does not happen when you use Compare()).

  • Windows has no concept of (non-)blocking diacritical marks. Windows just adds diacritical weight of nonspacing characters to that weight of previous primary character. And since the weight does not handle value more than 255, it often overflows and goes back to 0 (BTW the value 0 indicates the end of the sortkey in Windows API). Moreover, the fact means that there could be several pairs of sequences of diacritical marks that are incorrectly treated as equivalent. I recently mentioned one of such case in mono-devel-list.

  • Windows collation table is not internally optimizable. For example, (1)Hangul Syllables are sorted, being mixed with Jamo. In UCA it is implemented to be optimizable, but on Windows L/V/T forms are mixed depending on character equivalence and no blank values between them, so the large Hangul Syllable block could not be computed but must just be filled in a bulky table. In UCA and CLDR there is "optimize" option to take. Another example is (2)CJK compatibility characters. Those "compatibility" characters can, however, never be compatible with any of CompareOptions flags. Instead they are sorted next to the "compatible" character. However, if there is no such design, the sortkey table could be impressively optimized since the large chunk of CJK ideograph characters could be computed from their codepoints. I think they could be differentiated only in the third weight area (where width and case insensitivity coexist), or the fourth weight area (as well as Japanese Kana, since anyways those characters differ in primary level).

  • CJK sortkeys and CJK exntension sortkeys are immediately joined. On Windows, CJK ideograph ends at U+9FA5. In the latest Unicode the last CJK ideograph character is U+9FBB. But there is no proper space for U+9FA6-U+9FBB in Windows sortkey table (the sortkey for U+9FA5 is F0 B4 01 01 01 01 and U+F900 has F0 B5 01 01 01 01). The fact tells that it is impossible for CompareInfo/LCMapString to support higher Unicode versions, with existing Windows sortkey design. That is one of the reason that Windows can never support sorting in higher Unicode standards than version 1.0 which makes sense, as long as they keep sortkey values compatible.

Train of Sort #3 - Unicode version inconsistency

| No Comments | No TrackBacks

I forgot to announce that I had written an introduction to XML Schema Inference (it was nearly a month ago). I tried to describe what XML Schema Inference can do. As a technically-neutral person, I am also positive to describe what it cannot do.

I think this is also applicable to other xml structure inferences. Actually I applied it to my RelaxngInference which is however not complete. RELAX NG is rather flexible and you can add choice alternatives mostly anytime, but the resulting grammar will become sort of garbage ;-) Thus, simply-ruled structure still does make sense.

Ok, here is the third chapter of sort of crap train of sort.

Train of Sort #3 - Unicode version inconsistency

On surfing Japanese web pages, I came across aforum that there was a Japanese people in trouble that String.IndexOf() ignores U+3007 ("ideographic number zero" in "CJK symbols and punctuation") and thus his application went into infinite loop.

string s = "-\u3007-"; while (s.IndexOf ("--") >= 0) s = s.Replace ("--", "-");

That code should be changed to use CompareInfo.IndexOf (s, "--", CompareOptions.Ordinal), but it seemed too difficult for those guys who posts there to understand why such infinite loop happen. That U+3007 is a number "zero" and on comparing strings "10000" and "1" are regarded as equivalent if those zeros are actually U+3007. I wonder if that is acceptable for Japanese accounting softwares.

Since there is no design contract for Windows collation, I have even no idea if it is a bug in Windows or this is Microsoft's culture (string collation is culture dependent matter and there is no definition).

When I was digging into CompareOptions, I was really mazed. At first I thought that IgnoreSymbols just indicates to omit characters whose unicode category is BlahBlahSymbol and IgnoreNonSpace indicates to omit NonSpacingMark characters. It is totally wrong assumption. Many characters were ignorable than I have expected. For example, Char.UnicodeCategory() returns "LetterNumber" for U+3007 above.

That was unfortunately before I got to know that CompareInfo() is based on LCMapString whose behavior comes from Windows 95. Since there was no Unicode 3.1 it is not likely to happen that it matches with what Char.GetUnicodeCategory() returns for each character. However, there are non-ignorable characters which are added at Unicode 2.0 and 3.0 stages (latter than Windows 95). They are still in the dark age.

Thus, there are several characters that are "ignored" in CompareInfo.Compare(): Sinhala, Tibetan, Myanmar, Etiopic, Cherokee, Ogham, Runic, Tagalog, Hanunoo, Philippine, Buhid, Tagbanwa, Khmer and Mongolian characters, and Yi syllables and more - such as Braille patterns.

There should be no reason not to be consistent with the other Unicode related stuff such as System.Text.UnicodeEncoding, System.Char.GetUnicodeCategory() and especially String.Normalize().

So, the ideal solution here would be to use Unicode 3.1 table (further ideal solution is to use the latest Unicode 4.1 for all the related stuff). So I would really like to push Microsoft to discard existing CompareInfo (leave it as is) and add collation support in an alternative class (or inside CharUnicodeInfo or whatever). Sadly it is impossible just to upgrade Unicode version used in LCMapString, keeping it compatible. It is because of some inconsistent character categories (well, anyways .NET 2.0 upgraded the source of UnicodeCategory thus GetUnicodeCategory() returns inconsistent values for some characters though) and a design failure in Windows sortkey table (I'll describle more about it maybe next time).

Train of Sort #2 - decent, native oriented

| No Comments | No TrackBacks

Since I have actually written nothing yesterday, this is the first body of this Train of Sort series.

I think Windows collation table is designed rather carefully than UCA DUCET. On looking into blocks (U+2580 to U+259F) I noticed that many of them have the identical primary weight (though I guess U+25EF is incorrect). Thus some of them are equivalent when you pass CompareOptions.IgnoreNonSpace to those collation methods in CompareInfo. Another example, non-JIS CJK square characters are sorted in their Alphabetical names (though I doubt U+33C3).

Similarly, Windows collation is rather native oriented than UCA. I agree with Michael Kaplan on that statement (it is written in his entry I linked yesterday).

However, it is not always successful. Windows is not always well-designed for native speakers.

What I can immediately tell is about Japanese: Windows treats Japanese vertical repetition mark (U+3021 and U+3022) as to repeat just previous one character. That is totally wrong and none of native Japanese will interpret that mark to work like that. It is used to repeat two or more characters and that depends on context. For example, "U+304B U+308F U+308B U+3022" is "kawaru-gawaru" should be equivalent to U+304B U+308F U+308B U+304C U+308F U+308B in Hiragana. It should not be U+304B U+308F U+308B U+308B U+3099. Since the behavior of those marks is unpredictable (we can't determine how many characters are repeated), it is impossible to support U+3021 and U+3022 as repetition.

Similarly some Japanese people use U+3005 to repeat more than one character (I think this might be somewhat old way or incorrect, but that repeater is anyways somewhat older one and it is notable that it is Japanese that use this character in such ways).

They are not in part of UCA and all remaining essense of Japanese sorting (like dash mark interpretation) is common. Windows is too "aggressive" to support native languages than modest Unicode Standards do.

Well, Windows is not always worse than UCA. In Unicode NFKD, U+309B (full-width voice mark) is decomposed to U+0020 U+3099. Since every Kana character that have a voice mark is decomposed to non-voiced Kana and U+3099 (combining voice mark), they can never be regarded as equivalent (since there is extraneous U+0020). Since almost all Japanese IM puts U+309B instead of U+3099 for full-width voice mark, this NFD mapping totally does not make sense with related to collation. It's just a mapping failure.

System.Xml code is not mine

I came across M.David Peterson thinking that I have written all xml code. It is quite incorrect ;-)

  • System.Xml - XmlReader and XmlDocument related stuff are originally from Jason Diamond. XmlWriter related stuff are from Kral Ferch.
  • System.Xml.Schema - Dwivedi, Ajay kumar wrote non-compilation part of them.
  • System.Xml.Serialization - Lluis Sanchez rules.
  • System.Xml.XPath - Piers Haken implemented most of our XPath engine. After that, Ben Maurer restructured the design for XSLT. XPathNavigator is Jason Diamond's work.
  • System.Xml.Xsl - It is Ben Maurer who wrote most of the code.

Well, yes, I wrote most of other part than listed above (what remains ? ;-) I am just a grave keeper. And recently Andrew Skiba from Mainsoft is becoming another code hero in sys.xml area, contributing decent test framework and bugfixes in XML parsers and XSLT.

Train of Sort

| No Comments | No TrackBacks

The best approach to interoperability is to focus on getting widespread, conformant implementation of the XSLT 2.0 specification.

I just quoted from Microsoft's statement on XML Schema. Mhm, am so dazzled that I might have mistyped.

Actually it is quite wrong. There are some bugs in XML Schema specification e.g. undefined but significant order that happens with a combination of complex type extensions and substitution groups. None of XML schema implementation can be consistent with broken specification.

Train of Sort #1 - introduction

I have been working on our own text string collation engine that works like Windows. In other words I am working on our own System.Globalization.CompareInfo. Yesterday I posted the first working patch which is however highly unstable.

We used to use ICU for our collation engine, and currently it is disabled by default. Sadly there were some problems between our need and ICU. Basically ICU is for Unicode Collation Algorithm (UCA). Windows collation history started earlier than that of UCA, so it has its own way for string collation.

Since Windows collation is older, it has several problems that UCA does not have. Michael Kaplan, the leading Windows I18N developer (if you are interested in I18N his blog is kinda must read), once implied that Windows collation does better than UCA (he did not assert that Windows is better, so the statement is not incorrect). When I read that, I thought it sounds true. UCA - precisely, Default Unicode Collation Element Table (DUCET) - looks less related to native text collation (especially when it just depends on codepoint order in Unicode). However, on digging into Windows collation, I realized that it is not true. So here I start to describle the alternative side of Windows collation for a few days.

OpenDocument validator

| No Comments | No TrackBacks

Someone in the OpenDocument Fellowship read my blog entry on RelaxngValidatingReader and OpenDocument, and then asked me if I can create example code. Sure :-) Here is my tiny "odfvalidate" that validates .odt or its content xml files (any of content.xml, styles.xml, meta.xml, or settings.xml should be fine) using the RELAX NG schema.

I put an example file to validate, actually simply converted OpenDocument specification document by OO.o 2.0. When you validate the document, it will report you a validation error which is really an error in content.xml.

Mono users just can get that simple .cs file and OpenDocument rng schema, and then compile it as "mcs -r:Commons.Xml.Relaxng -r:ICSharpCode.SharpZipLib". Microsoft.NET users can still try it, by downloading those dll files listed there.

Beyond my tiny tool, I think AODL would be able to use it. In the above directory I put a tiny diff file for AODL code. Add Commons.Xml.Relaxng.dll as a reference, add OpenDocument-schema-v1.0-os.rng as a resource and then build. I haven't actually run it so it might not work though.

If any further questions, please feel free to ask me.