[sc34wg3] SAM issue: string-normalizationEditorial structure of N0396

Lars Marius Garshol sc34wg3@isotopicmaps.org
22 Apr 2003 17:04:38 +0200


See <URL:
http://www.ontopia.net/omnigator/models/topic_complete.jsp?tm=tm-standards.xtm&id=string-normalization
> for background on this.

* Patrick Durusau
|
| On a more technical issue, you might want to note that definition of
| String in the SAM [...] While following the W3C for XML 1.1 (see
| details at: http://www.w3.org/TR/charmod/) does exclude (unless this
| is one of those optional things) other normalization forms that may
| be required in non-Web based topic map contexts. 

The SAM says one must use NFC because we decided in Baltimore that
this was the right way to do it. You even wrote the minutes recording
that decision. Versions of SAM prior to N0396 did not require NFC;
they just required normalization.

Personally I thought it was better to not require a particular
normalization form, but to allow implementations to use whatever they
wanted to internally. However, as you'll see from the resolution I put
into the tm-standards.xtm topic map the reason we chose to do it this
way was that we resolved the op-sorting issue to say that sorting
happens by Unicode code point order. (You can see this, implicitly, in
the minutes as well.)

When sorting happens that way different implementations will sort the
same sequence of strings in different ways because the strings will be
different in those implementations.

So we must either:

 a) leave the normalization form undefined, and then also leave the
    sorting of sort names undefined (though probably we should in that
    case add some guidance telling users what sort of processing they
    should expect that implementations will apply when sorting), or

 b) leave string-normalization and op-sorting they way they are.

I don't think leaving the normalization form undefined is a problem in
and of itself so long as we say that normalization must be applied. So
long as we require normalization to be performed before merging
differently encoded topic maps (whether they come from XTM, HyTM, or
LTM) will merge correctly. Remember that the merging is not done on
the XML directly, but on the structure built from the XML, whatever
that may be. Also, NFD gives the same result as NFC, it just uses
different code point sequences to represent the same text.

| This may be of particular significance for systems using
| Chinese/Japanese texts in non-web based topic maps.

Why? As far as I know CJ characters don't decompose at all.

Note that I'm not disputing that there is, theoretically, a class of
of systems that might prefer NFD over NFC for one reason or another.
It's just that because of sorting we decided to go for NFC.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50                  <URL: http://www.garshol.priv.no >