August 2005 Archives

Default encoding for mcs

| No Comments | No TrackBacks

Today I checked in a patch for mcs that changes the default encoding of the source file, to System.Text.Encoding.Default instead of Latin1. Surprisingly we have been using it as the default encoding, regardless of the actual environment. For example, Microsoft csc uses Shift_JIS on my Japanese box, not Latin1 (Latin1 is even not listed as candidate when we - Japanese - save files in text editors).

This change brought confusion, since we have several Latin1-dependent source files in our own class libraries, and on modern Linux environment the default encoding is likely to be UTF8, regardless of the region and the language (while I was making this change on Windows machine). Thus I ended up to add /codepage:28591 to some classlib Makefiles at first, and then to revert that change and started a discussion.

I felt kinda depressed, on the other hand I thought it's kinda interesting. I must be making such a change that makes sense, but it is likely to bring confusion to EuroAmerican people. Yes, that is what we (non-EuroAmericans) had experienced - we could not compile sources and used to wonder why mcs can't, until we tried to use e.g. /codepage:932. When I went to Japanese Novell customer to help deploying ASP.NET application on SuSE server, I had to ask them to convert the source to UTF-8 (well now I think I could ask to set fileEncoding in web.config instead, but anyways not sure whether CP932 encoding worked fine or not).

If mcs does not premise Latin1, your Latin1-dependent ASP.NET applications won't run fine in the future version of mono on Linux. Even though mcs compiled sources with the platform-dependent default encoding instead of Latin1, it does not mean that your ASP.NET applications written in your "native" encoding (on Windows) will run fine on Linux without modifications. As I wrote above, the default encoding is likely to be UTF8 and thus you will have to explicitly specify "fileEncoding" in your web.config file.

Sounds pretty bad? Yeah, might be for Latin1 people. However, it has been happening on non-Latin1 environment. Thus almost no one is likely to be saved. But going back to "Latin1 by default" does not sound a good idea. At least it is not a culture-neutral decision.

There is a similar but totally different problem on DllImportAttribute.CharSet when the value is CharSet.Ansi. It is locale dependent encoding on Microsoft.NET, but in mono it is always UTF-8. The situation is similar, but here we use platform-neutral UTF-8 (well, I know that UTF-8 itself is still not culture-neutral, as it definitely prefer Latin over CJK. It lays Latin letters out in 2 byte areas, while most of CJK letters are in 3 byte areas, including only about 200 letters of Hiragana/Katakanas).

Managed collation support in Mono

| No Comments | No TrackBacks

Finally I checked in managed collation (CompareInfo) implementation in mono. It was about four months task, while I mostly spent my time just to find out how Windows collation table is composed. If Windows is conformant to Unicode standards (UTS #10, Unicode Collation Algorithm, and more importantly Unicode Character Database), it would have been much shorter (maybe in two months or so).

Even that I spent extra months, it resulted in a small good side effect - We got the Fact about Windows collation, which had been said good but in fact turned out not that good.

The worst concern now I have about .NET API is that CompareInfo is used EVERYWHERE we use corlib API such as String.IndexOf(), String.StartsWith(), ArrayList.Sort() etc. etc... They cause unexpected behavior on your application code which does not always have to be culture sensitive. We want them only under certain situations.

Anyways, now managed collation is checked in, though right now it's not activated unless you explicitly set environment variable. The implementation is almost close to Windows I believe. I started from SortKey binary data structures actually that was described in the book "Developing International Software, Second Edition"), sorting out how invariant GetSortKey() returns computed sort keys for each characters to find out how characters are categorized and how they can be drawn from Unicode Character Database (UCD). Now I know most of the composition of the sortkey table, which also uncovered that Windows sorting is not always based on UCD.

From Unicode Normalization (UTR #15) experience (actually it was the first attempt for my collation implementation to implement UTR #15, since I believed that Windows collation must have been conformant to Unicode standard), I made quick codepoint comparison optimization in Compare(), which made Compare() for the equivalent string 125x faster (in other words it was so slow). It became closer to MS.NET CompareInfo but still far behind ICU4C 3.4 (it is about 2x quicker). I'm pretty sure that it could be faster if I can spend more memory, but it is corlib where we should mind memory usage.

I tried some performance testing. Note that they are just some cases and collation performance is heavily dependent on the target strings. Anyways the results were like below and the code are here. (The numbers are seconds):

CORRECTION: I removed "Ordinal" comparison which was actually nothing to do with ICU. It was mono's internal (pretty straightforward) comparison icall.

Optionsw/ ICU4C 2.8w/ ICU4C 3.4Managed
Ordinal1.5021.5220.761
None0.6710.3910.741
StringSort0.6710.3900.731
IgnoreCase0.6810.3910.731
IgnoreSymbols0.6710.4100.741
IgnoreKanaType0.6710.3810.741

MS.NET is about 1.2x faster than ICU 2.8 (i.e. it is also much slower than ICU 3.4). Compared to ICU 3.4, it is much slower, but Compare() looks not so bad.

On the other hand, below are Index search related methods (I used IgnoreNonSpace here. The numbers are seconds again):

UPDATED: after some hacking, I could reduce the execution time by 3/4.

Operationw/ ICU4C 2.8w/ ICU4C 3.4Managed
IsPrefix1.1710.2300.231
IsSuffix000
IndexOf (1)0.080.085.7081.092
IndexOf (2)0.080.080.2700.230
LastIndexOf (1)0.9920.9914.5361.162
LastIndexOf (2)0.1700.1614.7301.492

Sadly index search stuff looks pretty slower for now. Well, actually it depends on the compared strings (for example, changing strings in IsSuffix() resulted in that managed collation is the fastest. The example code is practically not runnable under MS.NET which is extremely slow). One remaining optimization that immediately come up to my mind is IndexOf(), where none of known algorithms such as BM or KMP are taken... They might not be so straightforward for culture-sensitive search. It would be interesting to find out how (or whether) optimizations could be done here.

the "true" Checklists: XML Performance

| No Comments | No TrackBacks

While I knew that there are some documents named "Patterns and Practices" from Microsoft, I didn't know that there is a section for XML Performance. Sadly, I was scarcely satisfied by that document. The checklist is so insufficient, and some of them are missing the points (and some are even worse for performance). So here I put the "true" checklists to save misinformed .NET XML developers.

  • Design Considerations
    • Avoid XML as long as possible.
    • Avoid processing large documents.
    • Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
    • Avoid DTD, especially IDs and entity references.
    • Use streaming interfaces such as XmlReader or SAXdotnet.
    • Consider hard-coded processing, including validation.
    • Shorten node name length.
    • Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.
  • Parsing XML
    • Use XmlTextReader and avoid validating readers.
    • When node is required, consider using XmlDocument.ReadNode(), not the entire Load().
    • Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
    • Make full use of MoveToContent() and Skip(). They avoids extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
    • Avoid accessing Value for Text/CDATA nodes as long as possible.
  • Validating XML
    • Avoid extraneous validation.
    • Consider caching schemas.
    • Avoid identity constraint usage. Not only because it stores key/fields for the entire document, but also because the keys are boxed.
    • Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to Value string.
  • Writing XML
    • Write output directly as long as possible.
    • To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.
  • DOM Processing
    • Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
    • Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
    • Append nodes as soon as possible. Adding a big subtree results in longer extraneous run to check ID attributes.
    • Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList which is initially not instantiated.
  • XPath Processing
    • Consider to use XPathDocument but only when you need the entire document. With XmlDocument you can use ReadNode() but no equivalent for XPathDocument.
    • Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument they need access to PreviousSibling.
    • Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
    • Avoid position(), last() and positional predicates (especially things like foo[last()-1]).
    • Compile XPath string to XPathExpression and reuse it for frequent query.
    • Don't run XPath query frequently. It is costy since it always have to Clone() XPathNavigators.
  • XSLT Processing
    • Reuse (cache) XslTransform objects.
    • Avoid key() in XSLT. They can return all kind of nodes that prevents node-type based optimization.
    • Avoid document() especially with nonstatic argument.
    • Pull style (e.g. xsl:for-each) is usually better than template match.
    • Minimize output size. More importantly, minimize input.

What I felt funky was that they said that users should use XmlValidatingReader. With that they can never say that XmlReader is better than SAX on performance, since it creates value strings for every node (so the performance tips are mutually exclusive). It is not evitable even if you call Skip() or MoveToContent(), since skipping validation on skipped nodes is not allowed (actually in .NET 1.0 Microsoft developers did that and it caused bugs in XmlValidatingReader).

DTLL 0.4

| No Comments | No TrackBacks

Jeni Tennison told me (well, some guys including me) that she put a refresh version of Datatype Library Language (DTLL), which is expected to be ISO DSDL part 5.

Since Mono already has validators for RELAX NG with custom datatype support and NVDL, there would be no wonder if I was trying to implement DTLL support.

The true history was earlier, but it was the last spring that I practically started to write code for DTLL version 0.3. After hacking on the grammar object model, I asked Jeni Tennison for some examples, and she was kind enough to give me a decent one (the latest specification now includes even nicer examples). With that I have finished the grammar object model.

The next was of course, compilation to be ready for validation. But at that stage I was rather ambicious to support automatic CLI type mapping from DTLL datatype. I thought it was possible, but I noticed that DTLL model was kinda non-deterministic ... that is, with that version of DTLL I had to implement some kind of backtracking of already-mapped part of data if the input raw string did not match in a certain choice branch. With some questions (such as why "enumeration" needs indirect designation by XPath), I asked her about the possibility of limitation of the flexibility of item-pattern syntax. In DTLL version 0.3, the definition of "list" was:

</p> <h2>Lists</h2> <p>div {</p> <p>\list = element list { attribute separator { regular-expression }?, extension-attribute*, item-pattern+ }</p> <h2>Item Patterns</h2> <p>div {</p> <p>item-pattern |= group item-pattern |= choice item-pattern |= optional item-pattern |= zeroOrMore item-pattern |= oneOrMore item-pattern |= \list item-pattern |= item-datatype item-pattern |= item-value</p> <p>group = element group { extension-attribute<em>, item-pattern+ } choice = element choice { extension-attribute</em>, item-pattern+ } optional = element optional { extension-attribute<em>, item-pattern+ } zeroOrMore = element zeroOrMore { extension-attribute</em>, item-pattern+ } oneOrMore = element oneOrMore { extension-attribute*, item-pattern+ }</p> <p>item-datatype |= element data { extension-attribute<em>, type, param</em> }</p> <p>item-datatype |= anonymous-datatype</p> <p>item-value = element value { extension-attribute*, type?, text } } # End Item Patterns

... and today I was reading the latest DTLL 0.4, I noticed that:

</p> <h2>Lists</h2> <p>div {</p> <p>\list = element list { attribute separator { regular-expression }?, extension-attribute* } } #End Lists

... so, the entire item-pattern went away (!). I was surprised at her decision, but I'd say, it is nice. List became just a list of strings... I wonder if it really cannot be typed pattern (I might be shooting my foot now). The specification also removed "enumeration", so the pattern syntax became pretty simple. This change means that almost all patterns are treated as regular expressions. I wonder if it is me to blame ;-)

Now I don't think it is possible to automatically generate "DTLL datatype to native CLI type" conversion code, but I think that such conversion code should be provided from DTLL users. So it seems time for me to restart hacking on it...