viernes, 15 de enero de 2010

Standard xml - lang Language Codes

Nota: Algo sobre errores o quizás comparación entre los distintos lenguajes de programación. Vemos aqui algunos ejemplos y la página oficial de Normativas.








Program Manager (XMetaL)
Administrator
Member

Posts: 490



WWW
« on: December 11, 2009, 02:33:20 PM »


Product(s)
XMetaL Author Enterprise 6.0
(should also be supported in XMetaL Author Essential 6.0 and XMAX 6.0 when those products are released)

Oops
Instead of providing default settings in the xmetal60.ini file we decided to leave it up clients to decide on these values (that's actually a good thing), however, apparently we also missed documenting this as well.

Beware: this is going to be a long post ...

Request: I've spent a few hours writing this up and I think it covers most things at this point, but feedback is very welcome. 2009/12/15: I've made some changes directly to the original post after getting feedback from Richard Ishida (it should be easier to read all in one place rather than jumping back and forth between comments).

Background / Legacy Code
The values that XMetaL Author recognizes for spell checking default to legacy values that the product uses internally. These values were invented before xml:lang existed (actually before XML existed due to the fact that it originally came from another product). In most cases they do not match any of the RFC values most people would wish to use with xml:lang. Some of the more common ones happen to match (like EN) but this is just by chance and quite a few others do not.

Standard xml:lang Language Codes
The W3C XML Recommendation defines basic rules for xml:lang (how it must be declared in your DTD or Schema). Also related to this are the standards ISO-639-1, ISO-639-2, RFC4646, and RFC4647 and RFC5646 (the last one actually makes 4646 obsolete). Also related is BCP47 which is the reference preferred by the W3C. BCP47 is a concatenation of several RFCs and though long basically puts everything in one place.

Basically, ISO-639-1 consists of two letter language codes (that many people may recognize) and ISO-639-2 uses three letter codes. The RFCs describe how the full code should be constructed, and codes may include language, region, script, 'variants' and other things, including rules on letter casing and separator characters like "-". If you need to read one document please read BCP47.

We've tried to design our spell checking support for xml:lang to be as flexible as possible. This means you may opt to specify any "standard" value or you may use other values (perhaps from an industry or other standard you may wish to follow), and you may specify multiple values in the INI file for a particular spell checking language (keeping in mind that the value for the xml:lang attribute in the XML source itself can only have one value and will therefore either match one INI setting or none).

This means you should decide which values you will use based on all of your requirements, from external tools, XSLT transforms, specifications, etc, first. Then configure XMetaL Author's spell checker to understand the values you are working with. This is the approach I would recommend: let your requirements drive the values you use, but whenever possible stick to the most current W3C and associated standards.

Table of INI Variables Supported by the Spell Checker for xml:lang
Following is the complete list of currently supported spell checking languages. It includes the INI variable name (prefixed with "WT") that controls the values you wish to have recognized for the xml:lang attribute, the English name for the language, and the corresponding ISO-639-1 and ISO-639-2 value(s) that I think would most commonly be used for that language by most people working with xml:lang.

The values listed here for ISO639-1 and ISO-639-2 are suggestions only, though they were taken directly from those specs. Be sure you consult with other people in your organization before deciding on exact values as other tools and processes may have specific requirements.

INI Variable NameEnglish Name for the LanguageISO-639-1 ISO-639-2
WT_AFRIKAANSAfrikaansafafr
WT_CATALANCatalancacat
WT_CZECHCzechcsces, cze
Note: Both codes are considered synonyms.
WT_DANISHDanishdadan
WT_DUTCHDutchnldut, nld
Note: Both codes are considered synonyms.
WT_ENGLISHEnglisheneng
WT_FRENCHFrenchfrfra, fre
Note: Both codes are considered synonyms.
WT_GALICIANGalacianglglg
WT_GERMANGermandeger, deu
Note: Both codes are considered synonyms.
WT_GREEKGreekelgre, ell
Note(1): Both codes are considered synonyms.
Note(2): Ancient Greek (before the year 1454) is "grc" and is not supported by the spell checker.
WT_ISLANDICIslandic (Icelandic)isice, isl
Note: Both codes are considered synonyms.
WT_ITALIANItalianitita
WT_NORWEGIANNorwegiannonor
WT_PORTUGUESEPortugueseptpor
WT_RUSSIANRussianrurus
WT_SLOVAKSlovakskslo, slk
Note: Both codes are considered synonyms.
WT_SESOTHOSesotho (Sotho, South Sotho)stsot
WT_SPANISHSpanishesspa
WT_SWEDISHSwedishsvswe
WT_SETSWANASetswana (Tswana)tntsn
WT_TURKISHTurkishtrtur
WT_XHOSAXhosaxhxho
WT_ZULUZuluzuzul
WT_ENGLISH_AUSTRALIANAustralian Englishen-aueng-AU
WT_ENGLISH_CANADIANCanadian Englishen-caeng-CA
WT_ENGLISH_BRITISHBritish Englishen-gbeng-GB
WT_ENGLISH_USUnited States Englishen-useng-US
WT_FRENCH_CANADIANCanadian Frenchfr-cafra-CA, fre-CA
WT_GERMAN_SWISSSwiss Germande-chdeu-CH, ger-CH
WT_PORTUGUESE_BRASILBrazilian Portuguesept-brpor-BR
WT_SPANISH_AMERICANAmerican Spanishes-usspa-US
WT_NO_LINGUISTIC_CONTENTDo Not Spell Check (treat content as a non-spellcheckable language)
zxx

INI Settings Examples
The values listed below for ISO639-1 (two letter codes) and ISO-639-2 (three letter codes) are suggestions only, though they were taken directly from those specs. Be sure you consult with other people in your organization before deciding on exact values as other tools and processes may have specific requirements.

If the xml:lang code (the value portion of the INI variable) does not include the particular value you need just replace the existing one, or append your additional value to the end after adding a semicolon.

In the dialects section, two letter country codes are appended to the language code to make up "dialects" which are specific regional variances in languages, however (again) these values are here as examples only and it is up to you to decide what is correct for your organization's purposes.

#SPELL CHECKER LANGUAGES FOR xml:lang ATTRIBUTE VALUES
WT_AFRIKAANS=af;afr
WT_CATALAN=ca;cat
WT_CZECH=cs;ces;cze
WT_DANISH=da;dan
WT_DUTCH=nl;dut;nld
WT_ENGLISH=en;eng
WT_FRENCH=fr;fra;fre
WT_GALICIAN=gl;glg
WT_GERMAN=de;deu;ger
WT_GREEK=el;ell;gre
WT_ISLANDIC=is;ice;isl
WT_ITALIAN=it;ita
WT_NORWEGIAN=no;nor
WT_PORTUGUESE=pt;por
WT_RUSSIAN=ru;rus
WT_SLOVAK=sk;slk;slo
WT_SESOTHO=st;sot
WT_SPANISH=es;spa
WT_SWEDISH=sv;swe
WT_SETSWANA=tn;tsn
WT_TURKISH=tr;tur
WT_XHOSA=xh;xho
WT_ZULU=zu;zul
WT_NO_LINGUISTIC_CONTENT=zxx

#SPELL CHECKER DIALECTS FOR xml:lang ATTRIBUTE VALUES
WT_ENGLISH_AUSTRALIAN=en-AU;eng-AU
WT_ENGLISH_CANADIAN=en-CA;eng-CA
WT_ENGLISH_BRITISH=en-GB;eng-GB
WT_ENGLISH_US=en-US;eng-US
WT_FRENCH_CANADIAN=fr-CA;fra-CA;fre-CA
WT_GERMAN_SWISS=de-CH;deu-CH;ger-CH
WT_PORTUGUESE_BR=pt-BR;por-BR
WT_SPANISH_AMERICAN=es-US;spa-US


Note(1): If your xml:lang value's language code is not listed in the INI file then the fallback functionality of the spell checker is to use the default language as selected in the spell checker's Options dialog (set from within the main spell checker dialog, launched via F7).

Note(2): Letter casing (uppercase vs lowercase) is ignored with regard to xml:lang (ie: "EN-US", "en-us" and "en-US" are considered equivalent).

Note(3): zxx has been recommended to represent text that should not be interpreted as a standard human language. When it is used as set above XMetaL will skip over any element with xml:lang set to this value and not spell check it at all. This is useful for sections of programming code or perhaps other uses. As with all the other values here you may configure WT_NO_LINGUISTIC_CONTENT to whatever you like if "zxx" does not meet your needs (provided the value meets the xml:lang attribute value rules in the W3C XML Recommendation).

Note(4): Regardless of any settings in the INI file, when an xml:lang attribute value is set to be an empty string value, such as xml:lang="" that element will be skipped and not spell checked. This behavior is essentially equivalent to #3 above from the point of view of the spell checker (though it does have a distinct difference in meaning which is actually "no language" as opposed to "non human language"). However, XMetaL Author purposely makes it difficult for users to set an attribute value to be an empty string using the Attribute Inspector, so to do this you must either have implemented special code in your XMetaL Author customization to allow users to accomplish this, or you must set the value using PlainText view.

Note(5): The values you use in the INI file should be unique to each setting. Meaning that if you specify the same value in more than one INI variable unexpected behavior will occur. Please don't ask what the behavior might be, just avoid doing this.

Note(6): Do not specify the same INI variable multiple times. This should not be an issue as far as XMetaL is concerned, but you may not see the results you expect in this case. Again, please don't ask what the behavior might be, just avoid doing this.

The Shipped xmetal60.ini File
The following setting is included with the xmetal60.ini file:
WT_ENGLISH_BRITISH=EN-UK;EN-GB
This can be safely removed if desired. It should be removed if you will be specifying your own WT_ENGLISH_BRITISH settings elsewhere in the INI file to be sure there are no conflicts. Note however, that the default internal (legacy) code of "EN-UK" will be recognized if this variable is not present and set to another value.

How the Auto-Switching Works
The spell checker, whether you use the spell checking dialog (F7) or use the new 6.0 release's "check spelling while typing" option (see Tools > Options) aka: "red squiggles", XMetaL Author switches to the language specified in the xml:lang attribute when entering an element containing PCDATA (text).

If that element in turn has a child element with a different xml:value the spell checker changes to that corresponding child element's value. When no xml:lang value is set for an element it inherits the value of the parent element or nearest ancestor (standard xml:lang rules).

If such an element has no ancestors with an xml:lang value set then the default value for spell checking (as set in the spell checker's Options dialog) is used.

So, assuming you have all the settings above in your INI file and your default language is set to "English-US" in the spell checker's Options dialog, when entering a given element with one of the following xml:lang values the spell checker should do the following:
  • xml:lang is not set --> XMetaL begins walking up the document tree checking for parent elements with an xml:lang value set (and uses the nearest). If it fails to find any then the value as set in the spell checker Options dialog (in this case "English-US") is used.
  • xml:lang="en" --> All English spellings (US, UK, CA and AU) are considered correct (both "colour" and "color" are considered correct).
  • xml:lang = "en-US" --> English-US is used (ie: "color" is correct, "colour" is incorrect)
  • xml:lang = "en-CA" --> English-CA is used (ie: "colour" is correct, "color" is incorrect)
  • xml:lang="" --> no spell checking is performed (element is skipped)
  • xml:lang="zxx" --> no spell checking is performed (element is skipped)

External References

« Last Edit: January 05, 2010, 06:55:40 PM by Derek Read » Logged

Derek Read, Program Manager (XMetaL)
dcramer
Member

Posts: 68



« Reply #1 on: December 11, 2009, 02:49:12 PM »


This sounds very cool. Is there a way (ideally without adding a phony xml:lang on the element) to specify a list of elements that should, by default, be skipped in spell checking? That way you could configure it not to spell check code listings and code-like things. This isn't as urgent since it now has the red squiggly style spell checking which is less obtrusive, but still would be nice.

Thanks,
David

« Last Edit: December 11, 2009, 03:06:17 PM by dcramer » Logged

David Cramer
Technical Writer
Motive, an Alcatel-Lucent Company
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 490



WWW
« Reply #2 on: December 11, 2009, 03:08:54 PM »


dcramer:

There are new APIs in 6.0 that should allow you to do this. When I have some time to properly describe that I will create a new forum post that covers this, unless we release XMetaL Developer 6.0 and associated docs before that.


Logged

Derek Read, Program Manager (XMetaL)
rishida
Member

Posts: 1


« Reply #3 on: December 15, 2009, 09:39:12 AM »


First, this is a great step forward, and kudos to Justsystems for implementing it. I have just a few points i'd like to raise:

1. RFC 5646 obsoletes RFC4646 whether you choose to follow it or not ;-). On the other hand, it *doesn't* obsolete RFC4647 - thanks still current. Actually, at the W3C we prefer to refer to these specs using the label BCP 47 (http://www.rfc-editor.org/rfc/bcp/bcp47.txt). That covers both the language tag syntax spec (RFC 5646) and the matching spec (RFC 4647), and always refers to the most up-to-date version of each.

2. It doesn't seem quite strongly enough stated for my taste that, if you aren't dealing with legacy situations, you should use language tags as defined in BCP47 as it says in the XML spec. By implication, this means that you should use the IANA Language Subtag Registry to look up subtags, not use ISO code lists. This is important, because the IANA registry provides only one subtag per language, whereas the ISO codes sometimes offer two or three possibilities. The IANA registry also goes *way* beyond the list of codes offered by ISO 639-1/2, due to the inclusion of around 7,000 ISO 639-3 codes. Using the codes as defined in BCP47 and the IANA registry increases the interoperability of the data.

3. I would suggest that you label the right-mosts two columns in the table of languages above as BCP47 and Legacy, respectively. (Note that the region codes are not actually part of ISO 639.)

4. I assume that WT_German accepts spellings for either Swiss German or (National) German (eg. it fails to recognise incorrect omisson of es-zet characters). It may be worth adding a note to that effect to the WT_German line, since otherwise people may assume that de is sufficient for spell-checking normal German, when actually it isn't.

5. It may also be better to clarify the intended usage of zxx, which is *not* actually the same as xml:lang="", although the effect for spell checking is the same (ie. skip the text). See http://www.w3.org/International/questions/qa-no-language (for further clarification, see http://rishida.net/utils/subtags/index.php?lookup=zxx+und&submit=Look+up).

Hope that helps,
RI


Logged
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 490



WWW
« Reply #4 on: December 15, 2009, 02:44:02 PM »


@rishida

Thanks for the great feedback.

Because our software is very often only one piece of a larger installation (which often includes CMS systems, work flow management systems, translation memory and management systems, post processing systems, and systems that perform transformations to various file formats, and perhaps other things) the goal here was to show how to enable the XMetaL spell checker to take advantage of xml:lang values to support spell checking.

I'll leave education in usage of xml:lang up to the experts, and there is no shortage of information on this topic, including your posts here, the W3C XML Recommendation and the specs it links to, and many books on XML including my favourite "Charles Goldfarb's XML Handbook" (Charles Goldfarb and Paul Prescod 4th Edition: ISBN 0-13-065198-2; 5th Edition: 0-13-049765-7).

Regarding specific points...

1. The main point here (which you understood) was that, for any given client using XMetaL, various factors come into play that may make it difficult to stick with the latest specs (and ultimately, the XML source is often for internal use only with files to be consumed externally being transformations based on this internal format). I agree though -- whenever possible it makes sense to follow current specs.

2/3. I'll have a look at these suggestions and make some changes.

4. 'German National' is a special case. The first time German National is used you are prompted to select from one of three options (all XMetaL versions 4.x up to and including 6.0):
  • New spelling (Fluss)
  • Old spelling (Fluß)
  • Allow both

Swiss German allows the new spelling method only.

5. This is a good point as well. Approaching it strictly from an xml:lang usage point of view there really should be a difference (when used correctly), but (as you say) from the point of view of our spell checker there isn't any difference. The net result for the spell checker will be to skip over these elements.

« Last Edit: December 15, 2009, 07:57:47 PM by Derek Read »

No hay comentarios:

Correo Vaishnava

Mi foto
Spain
Correo Devocional

Archivo del blog