Default encoding for mcs

| No Comments | No TrackBacks

Today I checked in a patch for mcs that changes the default encoding of the source file, to System.Text.Encoding.Default instead of Latin1. Surprisingly we have been using it as the default encoding, regardless of the actual environment. For example, Microsoft csc uses Shift_JIS on my Japanese box, not Latin1 (Latin1 is even not listed as candidate when we - Japanese - save files in text editors).

This change brought confusion, since we have several Latin1-dependent source files in our own class libraries, and on modern Linux environment the default encoding is likely to be UTF8, regardless of the region and the language (while I was making this change on Windows machine). Thus I ended up to add /codepage:28591 to some classlib Makefiles at first, and then to revert that change and started a discussion.

I felt kinda depressed, on the other hand I thought it's kinda interesting. I must be making such a change that makes sense, but it is likely to bring confusion to EuroAmerican people. Yes, that is what we (non-EuroAmericans) had experienced - we could not compile sources and used to wonder why mcs can't, until we tried to use e.g. /codepage:932. When I went to Japanese Novell customer to help deploying ASP.NET application on SuSE server, I had to ask them to convert the source to UTF-8 (well now I think I could ask to set fileEncoding in web.config instead, but anyways not sure whether CP932 encoding worked fine or not).

If mcs does not premise Latin1, your Latin1-dependent ASP.NET applications won't run fine in the future version of mono on Linux. Even though mcs compiled sources with the platform-dependent default encoding instead of Latin1, it does not mean that your ASP.NET applications written in your "native" encoding (on Windows) will run fine on Linux without modifications. As I wrote above, the default encoding is likely to be UTF8 and thus you will have to explicitly specify "fileEncoding" in your web.config file.

Sounds pretty bad? Yeah, might be for Latin1 people. However, it has been happening on non-Latin1 environment. Thus almost no one is likely to be saved. But going back to "Latin1 by default" does not sound a good idea. At least it is not a culture-neutral decision.

There is a similar but totally different problem on DllImportAttribute.CharSet when the value is CharSet.Ansi. It is locale dependent encoding on Microsoft.NET, but in mono it is always UTF-8. The situation is similar, but here we use platform-neutral UTF-8 (well, I know that UTF-8 itself is still not culture-neutral, as it definitely prefer Latin over CJK. It lays Latin letters out in 2 byte areas, while most of CJK letters are in 3 byte areas, including only about 200 letters of Hiragana/Katakanas).

No TrackBacks

TrackBack URL: http://veritas-vos-liberabit.com/monogatari/mt-tb.cgi/47

Leave a comment