[sc34wg3] Just *doing* it!

Patrick Durusau sc34wg3@isotopicmaps.org
Wed, 29 Jan 2003 10:00:46 -0500


Greetings,

Since Lars suggests that we "just *do* it," I tender the following comments which I started on last week and they have morphed several times in response to the online discussion. Apologies for its late appearance.

I offer the following as a partial view of ISO 13250:

1. Principles of topic maps: A standard, separate from any particular model, of the underlying principles that comprise the topic map paradigm. Noting that we should learn from Jim Mason's report of the ODA experience and focus on making those principles explicit and not allow notational rigor become the actual goal. Such a "principles of topic maps" should provide a way to evaluate any particular model for its adherence to the principles that make topic maps, well, topic maps.

I see such "principles" as providing a means of distinguishing models for topic maps from models for other things, as well as providing a means to speak meaningfully about various aspects of topic map models.

2. A data model: Description of a topic map model, using a widely used modeling technique or formalism. Such a data model could and should be evaluated against the "principles of topic maps" suggested as #1.

I would note at this point that such a data model is a prerequisite to wide spread use and success of topic maps. Having said that, however, I would note that no particular data model is universal in scope and hence, a data model without the "principles of topic maps" artificially binds topic maps to a particular view of the world.

I have two non-topic map examples of why limiting a standard to one particular data model is something to be avoided.

a. Unicode: As many of you know, the Unicode standard is based upon what is known as the "character/glyph" model. Essentially that model posits that a Gothic capital A, an Arial capital A, and a New Times Roman capital A (sorry can't use the actual glyphs here) are all the same "character," although they have different glyphs. This lead the Unicode standard to say that if glyphs represent the same character, it only gets one code point in the Unicode standard. All well and good, so long as you are talking about modern typography.

The further you recede from modern typography, the more problematic the character/glyph distinction becomes. Consider that in Akkadian (a cuneiform writing system used for thousands of years by the Babylonians, Assyrians and numerous others) the glyph used for a particular character will vary according to the genre of literature, time period and even location. Now, from a strictly Unicode perspective, I only should have one code point but that means that I am losing the information that is represented by the glyph, such as the context of its usage, locale, etc.

You may be tempted to say, "well, but that is just an academic problem" but consider that a similar problem exists for Gangii characters among the millions of extant readers of CJK scripts, who quite reasonably object to a data model that does not fit their scripts. It is true that data can be flattened to fit any particular data model but that is not an acceptable option.

b. XML: While it has been denied that XML has a data model, and I suppose in the sense of an "official" statement that it follows a tree model that is true, it is never the less true that any well-formed XML document you examine looks a lot like a tree. It and SGML look enough like trees to mislead Renear, DeRose and others into positing that ordered hierarchies of container objects were some underlying ontological model of texts. That position was defended in a variety of weakening forms but was certainly true, from a certain point of view.

The problem with the implied data model of XML is that it works, but only if one holds a particular view of the structure of texts. Step outside that view, and the model fails. This is also far from being a merely academic problem. What is versioning if it is not a particular form of overlapping hierarchies that cannot be represented in a tree model? Legislative documents, legal contracts, editing environments, office documents, an entire industry to service their needs for real versioning of documents has been precluded by the data model of XML. If you can't represent it in the data model, then it is simply not possible and therefore not considered.

It should be noted that discarding of *irrelevant* markup is not a cause that only I and a few others find of interest. See, "An Algorithm for Streaming XPath Processing with Forward and Backward Axes." http://www.cs.nyu.edu/~deepak/publications/icde.pdf for a paper that uses an algorithm for XPath expressions that allows processing of 1 GB files. While the traditional XML model is quite useful, it is also the case that moving beyond it allows users to use their data as needed, rather than as allowed by a preset model.

It is the case that many useful topic maps can do doubt be constructed from any particular data model. The problem arises when new data arises or new views of old data (such as my versioning/overlapping hierarchies argument about XML) or new complexities are seen in data, that were not anticipated in a particular data model. It is insufficient to say that a data model can be fixed, perhaps or perhaps not. The point is that no data model ever has been, is, or will be the universal data model for all purposes and all data.

While a principles of topic maps would be interesting, without a data model based on a widely accepted view of data, such as XML, it would be little more than interesting. Note that the data model I suggest in #2 will probably represent to no small degree a current view and practices for handling data. That is why I would term it "a" data model and not "the" data model for topic maps. Contrary to popular belief in webland, there are very large and important data sets and models that have little to do with a web-centric view of the world.

This obviously leaves out a lot of parts/issues but it is a start, 
rather than talking about a start.

Patrick

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu
Co-Editor, ISO Reference Model for Topic Maps