[sc34wg3] Canonical XTM: implementation report

Kal Ahmed sc34wg3@isotopicmaps.org
Wed, 21 Jan 2004 20:33:54 +0000


On Wed, 2004-01-21 at 10:44, Lars Marius Garshol wrote:
> I've now written an implementation of the entire 2003-12-30 CXTM draft
> excluding only the representation of the [reified] property, and am
> happy to say that it took only a few hours. The biggest obstacle was
> handling type-instance associations, but even that wasn't really hard,
> though some minor trickery was required.
> 
Excellent! That makes you the winner of the first pint :)

> The resulting code is 583 lines and takes just a fraction of a second
> to canonicalize all of opera.xtm, so everything seems fine. The whole
> thing was really quite straightforward, though it is of course still
> possible that I've gotten something wrong somewhere.
> 
> My conclusion is that this is what we want. It really is implementable
> and it seems to me to work well and give a clear picture of model
> instances (at least once pretty-printed :). If we could get a draft
> that fixes all the known issues I think we should be ready to start
> creating test cases to verify that this really does work as it should.
> 
> Using Canonical XML as the basis for this specification appears to be
> fine, and although that document is hard to read that may not be much
> of a problem once we have some examples to work against. (Actually,
> the examples in the Canonical XML document should suffice.)
> 
> 
Yeah, the goal of CXTM should be string comparison, not
human-readability, although it would be useful to have some more
human-readable formatting, I think that its a minor issue really. Most
XML editors can automatically format XML to your chosen indentation
model, and its a really simple XSLT task to turn a CXML document into a
pretty-printed one (assuming your XSLT processor does a nice job of
indenting).

> If anyone wants actual canonicalized documents with corresponding
> input I'll be happy to provide examples.
> 
That would be a Good Thing.

> 
> --- General comments
> 
>  - The RNC schema was very helpful in verifying that I'd gotten things
>    right. I strongly recommend that we include it in an annex so that
>    it becomes an official part of the standard.
> 

Agreed.

>  - All the empty XML Infoset properties being specified throughout the
>    document makes the useful stuff drown in the dross and really makes
>    the document hard to read. I think it would be much easier to
>    review and implement this standard if we cut that out, since then
>    the substance would be visible rather than hidden.
> 

Yeah, its a balance between readability and completeness. In the end the
people round the table at Philadelphia voted for completeness. Lets see
what comes out of the CD stage, but I take your comment on board.

>  - The resulting documents are all on one line. For opera.xtm this
>    means 1.1MB of text on a single line, and it really gets awkward to
>    read. I think we should add some whitespace to the canonical
>    representation to make it more readable.
> 
>    The original "Canonical XTM" technical report had this in it:
> 
>    "The output document must be a canonical XML document. In addition,
>    a line feed (U+00A0) must be inserted after every end tag and
>    likewise after every start tag of elements that have element
>    content or are empty. (This means <baseNameString>, <resourceData>,
>    <topicRef>, <instanceOf>, <resourceRef>, <subjectIndicatorRef>.)"
> 
>    Maybe we should do something like it?
> 

See my comments above.

>  - My earlier comments about relativization of locators still apply,
>    of course.
> 

Yes, and that needs to be fixed

>  - Is this a committee draft? Will it appear in the SC34 document
>    registry? Has it already? (Couldn't find it.)
> 

Ken and I have had problems with the HTML and PDF generated by the
stylesheets and I got really busy and haven't got back to him yet on
this. So there is no CD yet, but I'll try and get something back to Ken
this week. It might make sense for me to edit in the stuff about
relativizing addresses along with some of the points below before going
out to CD, so there is some silver lining to this particular cloud.

> 
> --- Comments on specific section
> 
>  - 3.3: TMDM already requires strings to be in NFC, so there's no need
>    to repeat it here. (It could go in as a note if someone feels it's
>    a useful clarification.)
> 
OK, there is no sense in repeating TMDM - though I think the note would
be useful.

>  - 3.7: There's no need to compare on [variants] here. If the first 4
>    properties are equal the topic name items will have merged anyway.
> 

Good point. I'll take that out.

>  - 3.11: Association roles don't have scope, and the [parent] property
>    provides what is necessary for comparisons outside the context of
>    their parent association.
> 

Yep.

>  - 4.4: There is no [subject address] property any more, use [subject
>    locator]. (Hey, you were the one who pointed out that this should
>    be changed! :)
> 

I wasn't listening to myself while I said it, obviously :)

>  - 4.5: [scope] is never null, so this should say "the empty set"
>    instead.
> 
OK

>  - 4.7: The RNC schema contradicts the order given here. The schema
>    has scope first, then type, while the text has it the other way
>    around.
> 
The text is right, this order was changed for consistency with other
serialisation orders.

>  - 4.10: Here we need to give more guidance on how to serialize
>    locators. I think we should stress that they should be
>    externalized, meaning that in URIs difficult characters should be
>    escaped etc. Referencing some relevant W3C document specifying this
>    would be good. (RFC 2396 is less clear than it could be on
>    precisely which characters *must* be escaped.)
> 
Is it necessary to escape the characters ? What is TMDM's position on
this, I would have thought that the string value of the address property
is a completely unescaped string. After all, here we are not concerned
with this address being usable as a URI or even with it conforming to
the relevant RFCs, we just need a canonical string representation of the
address. So I would have thought an unescaped URI string in NFC would do
the trick.

>  - 4.10: Locators don't have [address]; it's [reference].
> 
Tell that to the TAG ;-). OK that needs to change.

>  - 4.12: This one is a lot of work to implement. You have to remember
>    the position of every object in the TM in case it could have been
>    reified. (Or test for it and remember it if it was reified, which
>    is even more work.) 
> 
>    Given that this property is in any case redundant (there will be a
>    pointer from [reifier] anyway) I think we should cut this. The only
>    consideration is what to do in query results, where the reified
>    might not be included. I think that the rules as given don't really
>    cover that case, and that we shouldn't try to. 
> 
>    (Unfortunately, the same problem applies to [reifier]. I'm tempted
>    to say that we should not try to handle it before we see that there
>    is a need for it for TMQL.)
> 
>    I recommend that we leave this out. After all, we don't do
>    topic.[roles]...
> 
I'm in two minds about this. On the one hand leaving it out does make
the canonicalisation process simpler. On the other hand it means that
there is no test that the [reified] property is set correctly. I think
CXTM needs to be complete even if it makes it more onerous to implement.

By the same logic, the topic.[roles] property must be canonicalised too.

>  - 4.13: Same two points as for 4.10. 
> 
Same responses then in that case :)

>  - 4.13: This is the one place where the canonicalization process
>    didn't feel entirely clean. We might want to make <resource> wrap
>    <locator> or even lose <resource> entirely and just have <locator>.
> 

Yeah, I guess that its not really necessary to  repeat 4.10 here - we
should instead canonicalise the locator item that is the value of the
[resource] property.

>  - 4.15: Should make it clear that the element is left out if [type]
>    is null. (I thought it was supposed to be empty before I saw that
>    <type> was consistently optional in the schema.)

I think that all the places that call out to canonicalising the [type]
property say that it should only be done if the property value is not
null. So I think we are covered.


Thanks for your feedback Lars Marius, its nice to know that it can be
done!

Cheers,

Kal
-- 
Kal Ahmed <kal@techquila.com>
techquila