[sc34wg3] Re: SAM issue: string-normalization

Patrick Durusau sc34wg3@isotopicmaps.org
Mon, 28 Apr 2003 10:01:45 -0400


Lars,

Apologies for dropping this thread!

Let me answer your last comment to see if anyone thinks this is an 
important issue:


>Patrick: | This may be of particular significance for systems using
>| Chinese/Japanese texts in non-web based topic maps.
>
>Lars: Why? As far as I know CJ characters don't decompose at all.
>
>Lars: Note that I'm not disputing that there is, theoretically, a class of
>of systems that might prefer NFD over NFC for one reason or another.
>It's just that because of sorting we decided to go for NFC.
>  
>

It is not a decomposition issue but compatibility-equivalent characters. 
>From Technical Report #15 from the Unicode Consortium 
(http://www.unicode.org/reports/tr15/)

> Normalization Form KC additionally levels the differences between 
> compatibility-equivalent characters which are inappropriately 
> distinguished in many circumstances. For example, the half-width and 
> full-width katakana characters will normalize to the same strings, as 
> will Roman Numerals and their letter equivalents. More complete 
> examples are provided in Annex 1: Examples and Charts 
> <http://www.unicode.org/reports/tr15/#Examples>.
>
> Normalization forms KC and KD must not be blindly applied to arbitrary 
> text. Since they erase many formatting distinctions, they will prevent 
> round-trip conversion to and from many legacy character sets, and 
> unless supplanted by formatting markup, may remove distinctions that 
> are important to the semantics of the text. The best way to think of 
> these normalization forms is like uppercase or lowercase mappings: 
> useful in certain contexts for identifying core meanings, but also 
> performing modifications to the text that may not always be 
> appropriate. They can be applied more freely to domains with 
> restricted character sets, such as in Annex 7: Programming Language 
> Identifiers 
> <http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers>.
>
Note that allowing the use of normalization forms KC and KD may impose 
additional requirements on the software.

Note that I checked with a professor at Kyoto University responded that:

>Unfortunately, the problems that do exist with
>C/J/K texts are not even touched upon by the Unicode normalization
>forms.  Given this, and the fact that XML 1.1 also requires NFC, I
>would think the Topic Map community is well on track by using it
>as well.
>
If there is a consensus by those who represent users in that domain on 
this position, I have no strong objections to keeping the prior 
decision. I noticed the possible problem when doing research on a TEI 
project and did not want the SAM to require a form of normalization that 
could have an adverse impact on some users.

Patrick

Lars Marius Garshol wrote:

>See <URL:
>http://www.ontopia.net/omnigator/models/topic_complete.jsp?tm=tm-standards.xtm&id=string-normalization
>  
>
>>for background on this.
>>    
>>
>
>* Patrick Durusau
>|
>| On a more technical issue, you might want to note that definition of
>| String in the SAM [...] While following the W3C for XML 1.1 (see
>| details at: http://www.w3.org/TR/charmod/) does exclude (unless this
>| is one of those optional things) other normalization forms that may
>| be required in non-Web based topic map contexts. 
>
>The SAM says one must use NFC because we decided in Baltimore that
>this was the right way to do it. You even wrote the minutes recording
>that decision. Versions of SAM prior to N0396 did not require NFC;
>they just required normalization.
>
>Personally I thought it was better to not require a particular
>normalization form, but to allow implementations to use whatever they
>wanted to internally. However, as you'll see from the resolution I put
>into the tm-standards.xtm topic map the reason we chose to do it this
>way was that we resolved the op-sorting issue to say that sorting
>happens by Unicode code point order. (You can see this, implicitly, in
>the minutes as well.)
>
>When sorting happens that way different implementations will sort the
>same sequence of strings in different ways because the strings will be
>different in those implementations.
>
>So we must either:
>
> a) leave the normalization form undefined, and then also leave the
>    sorting of sort names undefined (though probably we should in that
>    case add some guidance telling users what sort of processing they
>    should expect that implementations will apply when sorting), or
>
> b) leave string-normalization and op-sorting they way they are.
>
>I don't think leaving the normalization form undefined is a problem in
>and of itself so long as we say that normalization must be applied. So
>long as we require normalization to be performed before merging
>differently encoded topic maps (whether they come from XTM, HyTM, or
>LTM) will merge correctly. Remember that the merging is not done on
>the XML directly, but on the structure built from the XML, whatever
>that may be. Also, NFD gives the same result as NFC, it just uses
>different code point sequences to represent the same text.
>
>| This may be of particular significance for systems using
>| Chinese/Japanese texts in non-web based topic maps.
>
>Why? As far as I know CJ characters don't decompose at all.
>
>Note that I'm not disputing that there is, theoretically, a class of
>of systems that might prefer NFD over NFC for one reason or another.
>It's just that because of sorting we decided to go for NFC.
>
>  
>

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu
Co-Editor, ISO Reference Model for Topic Maps