Daniel Boone meets the consistent Web

July 22nd, 2008

[22 July 2008]

My colleague Thomas Roessler writes:

[The monotonic semantics of RDF] guarantee that you won’t run into a world of inconsistency when you discover additional information, and they also guarantee that you can learn things about the world piece by piece.

My evil twin Enrique responds: So let us start with the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is identical to the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which I assume I can express using some predicate like the OWL sameAs.

And now let us discover additional information in another triple store, which contains the information that “The individual denoted by http://www.w3.org/People/cmsmcq/2008/ns1#joe is distinct from the individual named http://www.w3.org/People/cmsmcq/2008/ns2#Josephus”, which it expresses using some predicate like the OWL differentFrom.

I’m having trouble understanding (concludes Enrique) how we can do this without either running into a world of inconsistency (a small world, perhaps, bounded in a nutshell, but still a world big enough for joe and Josephus to be both the same and different), or else running into a world in which we find that “inconsistency” has been defined to have a highly technical meaning under which the two triples just described are not actually inconsistent in the technical sense (why do I expect someone to start lecturing me about Herbrand models any moment now?), even though any application relying on the usual notions of identity and difference may find itself at a loss as to what to make of seeing them both in the same graph.

I reminded Enrique of the American pioneer Daniel Boone, who proudly claimed that he had never been lost in his life. Never? Never. [Pause.] “But I was a mite bewildered once for three days.” [Rimshot.]

Descriptive markup and data integration

July 22nd, 2008

In his enlightening essay Si tacuisses, Enrique …, my colleague Thomas Roessler outlines some specific ways in which RDF’s provision of a strictly monotonic semantics makes some things possible for applications of RDF, and makes other things impossible. He concludes by saying

RDF semantics, therefore, is exposed to criticism from two angles: On the small scale, it imposes restrictions on those who model data … that can indeed bite badly. On the large scale, real life isn’t monotonic …, and RDF’s modeling can’t deal with that….

XML is “dumb” enough to not be subject to either of these criticisms. It is, however, not even trying to address the issues that large-scale data integration and aggregation will bring.

I think TR may both underestimate the degree to which XML (like SGML before it) contributes to making large-scale data integration possible, and overestimate the contribution that can be made to this task by monotonic semantics. To make large-scale data integration and aggregation possible, what must be done? I think that in a lot of situations, the first task is not “ensure that the application semantics are monotonic” but “try to record the data in an application-independent, reusable form”. If you cannot say what the data mean without reference to a single authoritative application, then you cannot reuse the data. If you have not defined an application-independent semantics for the data, then you will experience huge difficulties with any reuse of the data. Bear in mind that data integration and aggregation (whether large-scale or small-) are intrinsically, necessarily, kinds of data reuse. No data reuse, no data integration.

For that reason, I think TR’s final sentence shows an underdeveloped appreciation for the relevant technologies. Like the development of centralized databases designed to control redundancy and store common information in application-independent ways, the development of descriptive markup in SGML helped lay an essential foundation for any form of secondary data integration. Or is there a way to integrate data usefully without knowing anything at all about what it means? Having achieved the hard-won ability to own and control our own information, instead of having it be owned by software vendors, we can now turn to ways in which we can organize its semantics to minimize downstream complications. But there is no need to begin the effort by saying “well, the effort to wrest control of information from proprietary formats is all well and good, but it really isn’t trying to solve the problems of large-scale data integration that we are interested in.”

(Enrique whistled when he read that sentence. “You really want to dive down that rathole? Look, some people worked hard to achieve something; some other people didn’t think highly enough of the work the first people did, or didn’t talk about it with enough superlatives. Do you want to spend this post addressing your deep-seated feelings of inadequacy and your sense of being under-appreciated? Or do you want to talk about data integration? Sheesh. Dry up, wouldja?“)

Conversely, I think TR may overestimate the importance of the contribution RDF, or any similar technology, can make to useful data integration. Any data store that can be thought of as a conjunction of sentences can be merged through the simple process of set union; RDF’s restriction to atomic triples contributes nothing (as far as I can currently see) to that mergeability. (Are there ways in which RDF triple stores are mergeable that Topic Map graphs are not mergeable? Or relational data stores?)

And it’s not clear to me that simple mechanical mergeability in itself contributes all that much to our ability to integrate data from different sources. Data integration, as I understand the term, involves putting together information from different source to achieve some purpose or accomplish some task. But using information to achieve a purpose always involves understanding the information and seeing how it can be brought to bear on the problem. In my experience, finding or making a human brain with the required understanding is the hard part; once that’s available, the kinds of simple automatic mergers made possible by RDF or Topic Maps have seemed (in my experience, which may be woefully inadequate in this regard) a useful convenience, but not always an essential one. It might well be that the data from source A cannot be merged mechanically with that from source B, but an integrator who understands how to use the data from A and B to solve a problem will often experience no particular difficulty working around that impossibility.

I don’t mean to underestimate the utility of simple mechanical processing steps. They can reduce costs and increase reliability. (That’s why I’m interested in validation.) But by themselves they will never actually solve any very interesting problems, and the contribution of mechanical tools seems to me smaller than the contribution of the human understanding needed to deploy them usefully.

And finally, I think Thomas’s post raises an important and delicate question about the boundaries RDF sets to application semantics. An important prerequisite for useful data integration is, it would seem, that there be some useful data worth retaining and integrating. How thoroughly can we convince ourselves that in requiring monotonic semantics RDF has not excluded from its purview important classes of information most conveniently represented in other ways?

RDF, Topic Maps, predicate calculus, and the Queen of Romania

July 22nd, 2008

[22 July 2008]

Some colleagues and I spent time not long ago discussing the proposition that RDF has intrinsic semantics in a way that XML does not. My view, influenced by some long-ago thoughts about RDF, was that there is no serious difference between RDF and XML here: from interesting semantics we learn things about the real world, and neither the RDF spec nor the XML spec provides any particular set of semantic primitives for talking about the world. The maker of the vocabulary can (I oversimplify slightly, complexification below) make terms mean pretty much anything they want: this is critical both to XML and to RDF. The only way, looking at an RDF graph or the markup in an XML document, to know whether it is talking about the gross national product or the correct way to make adobe, is to look at the documentation. This analysis, of course, is based on interpreting the propositition we were discussing in a particular way, as claiming that in some way you know more about what an RDF graph is saying than you know about what an SGML or XML document is saying, without the need for human intervention. Such a claim seems patently false, but as far as I can tell it is what some of my colleagues have been trying to persuade me of for years.

(I should point out that if one understands the vocabulary used to define classes and subclasses in the RDF graph, of course, the chances of hitting upon useful documentation are somewhat increased. If you don’t know what vug means, but know that it is a subclass of cavity, which in turn is (let’s say) a subclass of the class of geological formations, then even if vug is otherwise inadequately documented you may have a chance of understanding, sort of, kind of, what’s going on in the part of the RDF graph that mentions vugs. I was about to say that this means one’s chances of finding useful documentation may be better with RDF than with naked XML, but my evil twin Enrique points out that the same point applies if you understand the notation used to define superclass/subclass relations [or, as they are more usually called, supertype/subtype relations] in XSD [the XML Schema Definition Language]. He’s right, so the ability to find documentation for sub- and superclasses doesn’t seem to distinguish RDF from XML.)

This particular group of colleagues, however, had (for the most part) a different reason for saying that RDF has more semantics than XML.

Thomas Roessler has recently posted a concise but still rather complex statement of the contract that producers of RDF enter into with the consumers of RDF, and the way in which it can be said to justify the proposition that RDF has more semantics built-in than XML.

My bumper-sticker summary, though, is simpler. When looking at an XML document, you know that the meaning of the document is given by an interaction of (1) the rules for interpreting the document shaped by the designer of the vocabulary and by the usage of the document creator with (2) the actual content of the document. The rules given by the vocabulary designer and document author, in turn, are limited only by human ingenuity. If someone wants to specify a vocabulary in which the correct interpretation of an element requires that you perform gematriya on the element’s generic identifier (element type name, as the XML spec calls it) and then feed the resulting number into a specific random number generator as a seed, then we can say that that’s probably not good design, but we can’t stop them. (Actually, I’m not sure that RDF can stop that particular case, either. Hmm. I keep trying to identify differences and finding similarities instead.)

(Enrique interrupted me here. “Gematriya?” “A hermeneutic tool beloved of some Jewish mystics. Each letter of the alphabet has a numeric value, and the numerical value for a concept may be derived from the numbers of the letters which spell the word for the concept. Arithmetic relations among the gematriya for different words signal conceptual relations among the ideas they denote.” “Where do you get this stuff? Reading Chaim Potok or something?” “Well, yeah, and Knuth for the random-number generator, but there are analogous numerological practices in other traditions, too. Should I add a note saying that the output of the random number generator is used to perform the sortes Vergilianae?” “No,” he said, “just shut up, would you?”)

In RDF, on the other hand, you do know some things.

  1. You know the “meaning” of the RDF graph can be paraphrased as the conjunction of a set of declarative sentences.
  2. You know that each of those declarative sentences is atomic and semantically independent of all others. (That is, RDF allows no compound structures other than conjunction; it differs in this way from programming languages and from predicate logic — indeed, from virtually all formally defined notations which require context-free grammars — which allow recursive structures whose meaning must be determined top-down, and whose meaning is not the same as the conjunction of their parts. The sentences P and Q are both part of the sentence “if P then Q”, but the meaning of that sentence is not the same as the conjunction of the parts P and Q.)

When my colleagues succeeded in making me understand that on the basis of these two facts one could plausibly claim that RDF has, intrinsically, more semantics than XML, I was at first incredulous. It seems a very thin claim. Knowing that the graph in front of me can be paraphrased as a set of short declarative sentences doesn’t seem to tell me what it means, any more than suspecting that the radio traffic between spies and spymasters consists of reports going one direction and instructions going the other tells us how to crack the code being used. But as Thomas points out, these two facts are fairly important as principles that allow RDF graphs to be merged without violence to their meaning, which is an important task in data integration. Similar principles (or perhaps at this level of abstraction they are the same principles) are important in allowing topic maps to be merged safely.

Of course, there is a flip side. If a notation restricts itself to a monotonic semantics of this kind (in which no well formed formula ever appears in an expression without licensing us to throw away the rest of the expression and assume that the formula we found in it has been asserted), then some important conveniences seem to be lost. I am told that for a given statement P, it’s not impossible to express the proposition “not P” in RDF, but I gather than it does not involve any construct that resembles the expression for P itself. And similarly, constructions familiar from sentential logic like “P or Q”, “P only if Q”, and “P if and only if Q” must all be translated into constructions which do not contain, as subexpressions, the expressions for P or Q themselves.

At the very least, this seems likely to be inconvenient and opaque.

Several questions come thronging to the fore whenever I get this far in my ruminations on this topic.

  • Do Topic Maps have a similarly restrictive monotonic semantics?
  • Could we get a less baroque representation of complex conditionals with something like Lars-Marius Garshol’s quads, in which the minimal atomic form of utterance has subject, verb, object, and who-said-so components, so that having a quad in your store does not commit you to belief in the proposition captured in its triple the way that having a triple in your triple-store does? Or do quads just lead to other problems?
  • If we accept as true my claim that XML can in theory express imperative, interrogative, exclamatory, or other non-declarative semantics (fans of Roman Jakobson’s 1960 essay on Linguistics and Poetics may now chant, in unison, “expressive, conative, meta-lingual, phatic, poetic”, thank you very much, no, don’t add “referential”, that’s the point, the ability to do referential semantics is not a distinguishing feature here), does that fact do anyone any good? The fundamental idea of descriptive markup has sometimes been analysed as consisting of (a) declarative (not imperative!) semantics and (b) logical rather than appearance-oriented markup of the document; if that analysis is sound (and I had always thought so), then presumably the use of XML for non-declarative semantics should be regarded as eccentric and probably not good practice, but unavoidable. In order to achieve declarative semantics, it was necessary to invent SGML (or something like it), but neither SGML nor XML enforce, or attempt to enforce, a declarative semantics. So is the ability to define XML vocabularies with non-declarative semantics anything other than an artifact of the system design? (I’m tempted to say “a spandrel”, but let’s not go into evolutionary biology.)
  • Is there a short, clear story about the relation between the kinds of things you can and cannot express in RDF, or Topic Maps, and the kinds of things expressible and inexpressible in other notations like first-order predicate calculus, sentential calculus, the relational model, and natural language? (Or even a long opaque story?) What i have in mind here is chapter 10 in Clocksin and Mellish’s Programming in Prolog, “The Relation of Prolog to Logic”, in which they clarify the relative expressive powers of first-order predicate calculus and Prolog by showing how to translate sentences from the first to the second, observing along the way exactly when and how expressive power or nuance gets lost. Can I translate arbitrary first-order predicate calculus expressions into RDF? How? Into Topic Maps? How? What gets lost on the way?

It will not surprise me to learn that these are old well understood questions, and that all I really need to do is RTFM. (Actually, that would be good news: one of the scariest moments of my time at W3C came when not long after my arrival I was talking with some guys from the Semantic Web Activity and asked “So how, in RDF, do I distinguish when I’m talking about some non-network-accessible resources, for example you as a human being, and when I’m talking about the sequence of octets I get back when I dereference that URI?” When they said “That’s a really interesting and important question!” my immediate thought was “It shouldn’t be,” but I managed to bite my tongue hard enough to avoid saying that. Apparently, alas, it still is important.) (In another sense, of course, it would be horrible news to be told to RTFM. I’ve lost count of the times I’ve tried to read Resource Description Framework (RDF) Model and Syntax Specification and given up because I found it so hard to follow. But knowing that there is an FM to read would be comforting in its way, even if I never managed to read it. RDF isn’t really my day job, after all.)

How comfortable can we be in our formalization of the world, when for the sake of tractability our formalizations are weaker than predicate calculus, given that even predicate calculus is so poor at capturing even simple natural-language discourse? Don’t tell me we are expending all this effort to build a Semantic Web in which we won’t even be able to utter counterfactual conditionals?! What good is a formal notation for information which does not allow us to capture a sentence like the one with which Lou Burnard once dismissed a claim I had made:

“If that is the case, then I am the Queen of Romania.”

The OOXML debates (non-combatant’s perspective)

July 22nd, 2008

[21-22 July 2008]

So far, I have managed to avoid participating in the debates over standardizing OOXML, and I don’t plan for that to change. But my evil twin Enrique and I spent some wickedly enjoyable time this afternoon reading a lot of postings in that debate, from a variety of sources, when I should have been working on other things. (“Log it as ‘Professional - continuing education’,” suggested Enrique. I may do that.)

It’s interesting to be able to observe a hard-fought technical battle in which (other people’s) feelings run high but in which one does not have a large personal stake. So many rhetorical maneuvers are familiar, the deterioration of the quality of the argument brings back so many memories of other technical arguments in which (distracted by caring about the outcome) the observer may not have been able to appreciate the rhetorical ingenuity of some of the contributions.

What strikes both Enrique and me is how distinct the styles of argumentation on the various sides of the debate are. We counted three, not two, in this battle, but we could be undercounting.

On one side, there is a class of contributions carefully kept as thoroughly emotionless as possible, focusing exclusively on technical (or at least substantive) issues — even when the contribution was intended to persuade others of a course of action. This seems, at first, an unusual rhetorical choice: I think most advertisers tend to prefer enthusiasm to a studied lack of emotion in trying to sell things. Still, this class includes some of the people whose judgement I have the most reason to respect, and in an over-heated environment a strict objectivity can be immensely attractive.

There is a second class of contributions, which provide a complex mix of a more emotional, excitable, even passionate, style of argumentation, which is however almost always tethered to concrete, verifiable (or falsifiable) propositions about technical properties of OOXML (and ODF), about process issues, and so on. The contributions of this class are by no means always well reasoned or insightful, but they are all recognizably arguments which can be refuted.

And there is a third class, which contains some of the most inventive ad hominem attacks, imaginative name-calling, and insidious smears I have ever seen outside of recent U.S. national electoral politics.

What is striking and puzzling to me is how cleanly the three different rhetorical styles seem to me to map to different positions (let me call them left and right, without mapping left/right into pro/con) on OOXML. If you see a statement that could in principle be verified or falsified by an impartial third party, there is a much better than even chance that it’s from a contribution arguing, let us call it, the left-hand position. And if you see an infuriatingly smug piece which avoids addressing actual technical issues and confines itself to name-calling, slander, and innuendo, there is a very strong chance that it’s taking a right-hand position. (I’m speaking here mostly of bloggers and essayists, not of those who have commented on various blog posts — the blog comments are uniformly smug and infuriating regardless of handedness.)

I have tried not to say explicitly which position each of these styles is associated with, because if Enrique and I are right then all you have to do is (re)read some of the rhetorical barrages of the last year or two to see which is which. (Those of my readers who care about the outcome, or about the health and reputation of the institutions involved, may find this too painful to contemplate. I’m sorry; you don’t have to if you don’t want to.) And if we’re wrong (and we may be — we only had stomach for an afternoon’s worth of the stuff, not more), then there’s no fairness in pointing the finger of blame at just one side for the incivility that can be seen in the discussion of OOXML.

And in any case, as Enrique points out with a certain malicious glee, “Most people who don’t look into it will assume that the merits of the technical arguments must be with the first or second groups, because they don’t descend (or more correctly they descend less often) to slander and name-calling. But there is no rule that says that just because those on one side of an argument argue unfairly or irrelevantly, or act with infuriating disregard of basic rules of courteous technical discussion, then it’s safe to conclude that they have the wrong end of the technical stick, any more than it’s safe to conclude that an invalid argument has reached a false conclusion. Unfairness and low behavior don’t mean people aren’t right in the end.”

Enrique may be right. But watching the OOXML debates serves as a salutary reminder that when some in a technical discussion descend to name-calling and slander (and what better to spice up a blog with?), the animosities created during the process will hover over the result of the decision for a long time.


Memo to self: in future, try to be calmer and more fair in discussions.

(“Yeah,” I hear Enrique mutter. “Leave the dirty work to me.”)

Treating our information with the care it deserves

July 22nd, 2008

[21-22 July 2008]

I don’t make a habit of recording here all of the interesting, useful, or amusing things I read. But I am quite taken with Steve Pepper’s account of the situation in which many large organizations find themselves in. In a blog post devoted to a different topic (the history of Norway’s vote on OOXML), he describes (his understanding of) one organization’s point of view and motivations:

They are a big MS Office user, they participated in TC45 (the Ecma committee responsible for OOXML) and they clearly feel that OOXML is important to them.

I can understand why. An enormous amount of their intellectual capital is tied up in proprietary formats – in particular Excel – that have been owned and controlled by a vendor for the last 20 or so years. StatoilHydro has literally had no way of getting at its own information, short of paying license fees to Microsoft. Recently the company has started to realize the enormity of the mistake it has made in not treating its information with the care and respect it deserves.

As he points out, they are of course not alone in having made this mistake, particularly if one includes other proprietary formats beyond Office, and other vendors than Microsoft.

Several points occur to me:

  • It’s easy for me to feel superior and to lack interest in the problems of converting legacy data: I stopped using proprietary formats about twenty years ago, not very long after I had acquired a personal computer and gained the opportunity to start using them in the first place, and so with very few exceptions pretty much every piece of information I have created over my career is still readable. (A prospective collaboration did collapse once when at the end of a full-day meeting, as we were deciding who would draft what piece of the grant proposal, I asked what DTD we would be using, and my soon-to-be-former prospective collaborators said they had planned to be working in a proprietary word processor.) But feeling superior is not really a useful analysis of the situation.

Enrique: Are you saying that there are no proprietary data formats on mainframes? Whom are you trying to kid?

Me: No, but all my mainframe usage was on university mainframes; we don’t seem to have been able to afford any seriously proprietary software, at least any that was interesting to me. I was mostly doing document preparation, and later on database work. And for a while I maintained the terminal translation tables.

Enrique: The what?

Me: Never mind. There used to be things called terminals, and … Sorry I brought it up.

Enrique: And your databases didn’t use proprietary formats?

Me: Internally, sure. But they could all dump the data in a reusable text file format. I think I translated the Spires dump of my bibliographic data to XML once. Or maybe that was just something that went on the Someday pile.

Enrique: You’re right. Feelings of superiority are not really an adequate analysis of a complex situation. Even if the feelings were justified, which in this case, Bucko, does not seem to be the case.

  • The right solution for these organizations is, perhaps, to move away from such closed systems once for all, and use semantically richer markup. Certainly that’s where my immediate sympathies lie. It’s not impossible: lots of organizations use surprisingly rich markup for data they care about.
  • But how are they to get there, starting from where they are now? Even if the long-term benefits are substantial (which is close to self-evident for me, but is likely to sound very unproven to any serious organizational IT person), you have to get through the short term in order to reach the long term. So the ideal migration path starts paying off very quickly, even before you’ve gone very far. (Paoli’s Law: if people put five cents of effort in, they want to see a nickel in return, and quickly.) Can there be such a migration path? Or is going cold turkey the only way to go?
  • The desire to get as much benefit for as little work as possible seems to make everyone with a legacy-data problem easy prey for snake-oil salesmen. I don’t see any prospect of this changing, though, ever.

Enrique: Nah. Snake oil, now there’s a growth stock.

Six-month retrospective and evaluation

July 16th, 2008

[16 July 2008]

This klog started about six months ago, as an experiment. In an early post, I wrote:

So I’m going to start a six-month experiment in keeping a work log. Think of it, dear reader, as my lab notebook. (I was going to do it starting a year ago, but, well, I didn’t. So I’m going to start now.)

My original plan was to make it accessible only to the W3C Team, so that I could talk about things that probably shouldn’t be discussed in public or in member space. Norm Walsh has blown a hole in that idea by pointing to this log [Hi, Norm!]. So public it is. (Ideally, I’d have a blog in which each item could be marked with an ACL, like resources in W3C date space: Team-only, Member-only, World-readable. Maybe later.)

Next year about June, if I remember, I will evaluate the experiment and decide whether it’s been useful for me or not.

So, as one of my teachers used to say at the beginning of a group evaluation of some student work: what works, what doesn’t work?

Things that don’t work as well as I would like:

  • As might have been predicted, the fact that Messages in a Bottle is public, not private, has encouraged me to be circumspect in ways that fight with the lab-notebook goal. I don’t want to be carelessly rude about colleagues or others in public, the way one can be in private conversations and to their faces. Across a dinner table, one can greet a claim made by a colleague with a straightforward cry of “But that’s bullcrap!” without impeding a useful discussion. (This depends in part on the conversational style cultivated by individuals and groups, of course. But as some readers of this post will know, this is not a speculation but a report.) It doesn’t feel quite right, however, to say in public of something proposed by someone acting in good faith that it’s just bullcrap. You have to spend some time thinking of another way to put it. Enrique comes in handy here, since he will say anything. It has not been proven, however, that Enrique will never piss anyone off.
  • For the same reason, I have not yet found a good way of recording issues and concerns I don’t have good answers for. In a lab notebook, or a private conversation, one can talk more forthrightly about things that are going wrong, or things that have gone wrong, and how to right them. But in public, members of a Working Group, and editors of a specification, do better to accept a sort of cabinet responsibility for the work product. You do the best you can to lead the group to what you believe is the right decision, and then you accept the decision and defend it in public. I have not yet found a way to combine the acceptance of that joint responsibility, and the concomitant need to avoid bad-mouthing decisions one is responsible for defending, on the one hand, with forthright analysis of errors on the other. Sometimes careful phrasing can do the job, but any need for care in phrasing constitutes a tax on the writing of posts about tricky subject matter.
  • So try as I might to keep pushing these posts toward being a work log, the genre keeps pushing back and trying to make them into something like a first-person newspaper column. That’s a fine and worthy thing, and I can’t say I don’t enjoy that genre, but it’s not quite what I was aiming for when I started. As a result, one cannot read back through the archives and get the kind of record one wants in a lab notebook, and I’m not sure Messages in a Bottle is working optimally as a means for me to communicate with myself, or with those I work with most closely.

And on the other side, some things do seem to work.

  • At one level of abstraction, the primary goal of this worklog is to improve communication between me and those I work with. There is some evidence, both in the comments here and in other channels, that some of those I work with do read these postings and find them useful, or at least diverting. I have never bothered to try to check the server logs for hit or visitor counts — my guess, based on my Spam Karma 2 reports, is that humans are strongly outnumbered by spambots among my readers, and I’d just as soon not have that demonstrated in quantitative detail — but it’s clear that more people read these posts at least sporadically than I would ever dream of pestering by sending them email meditations on these topics. If they read these posts and derive any insight from the reading, then this klog would appear to have improved communication at least somewhat.
  • It’s probably not actually a bad thing that I think of this as a public space. It makes me a bit more likely to try to write coherently, to supply relevant context, and to do the other things that help ensure that a communication can be read with understanding by readers distant in time, space, sentiment, or context from the author. If I occasionally indulge in a private joke or two, I hope you will bear with me.
  • It’s easier for me to find records of points of view and analyses that have gone into posts here than to find records kept only in files on my hard disk or on paper shoved into the shelves behind me.
  • So far, no one has complained even about the really boring technical discussions about regular grammars, even though it’s clear some of my readers would rather be reading about Enrique.

In sum, I think I believe the experiment can be adjudged modestly successful, and I will continue it for another six months.

Enrique on what RDF gets us

July 14th, 2008

[14 July 2009]

As reported in my previous post, I’ve been thinking about RDF a bit lately. So I’ve decided to dust off some meditations on the subject that originated several years ago.

I was feeding the dogs one evening when Enrique dropped by and complained bitterly about the shortcoming of various colleagues’ attempts to persuade people (including me) of the value of RDF: the overstatement, the misrepresentations of other technologies (both XML and relational databases), the overselling of RDF’s virtues. “True, they would make anyone with any marketing sense tear their hair out,” I said. “But it’s not rational to infer that there are no arguments for RDF, just because its advocates make such a poor show of arguing for it. If you want to understand what RDF does, without overstatement and without mischaracterization of other technologies, why don’t you try constructing a dispassionate account of what RDF does and doesn’t get us?”

Enrique’s response was something like what follows.

It should be noted that Enrique focuses here on RDF itself, not RDF + OWL. OWL was still very new at the time, and Enrique was reacting to years of rhetoric about how RDF, by itself, was semantically richer than XML. I have also corrected a slip or two in Enrique’s original effort; he couldn’t remember the term phatic, for example.

I wonder (Enrique said) if RDF can be summarized in three points:

  • It proposes a way to think about information: there are things, they have properties.
  • It proposes that we use a single universe of names for all individuals: URIs.
  • It provides a single model of property attribution, namely the binary predicate, and thus gives us three well known roles (subject, verb, object, or relation-name, first-argument, second-argument) for participants in relations.

These may be worth some commentary.

It proposes a way to think about information: there are things, they have properties.

There’s no proof that all information, or all knowledge, or all propositions, can be thought of as being about things with properties. In fact, there are many very bright philosophers who deny it outright. But those who deny it don’t provide anything of similar convenience for machine processing.

Formal logic as usually taught today similarly tells us how to talk about things with properties. It’s quite plausible that there are things we can’t express conveniently or at all in formal logic — just look at the mess formal logicians are in trying to justify the truth table for material implication — but just as formal logic can be useful even if there are things it cannot do, so also for any way of talking about things and properties.

Things and properties, as usually considered, don’t capture very well the expressive, conative, metalingual, or phatic aspects of language, as Jakobson calls them (let alone the poetic), just the representational. Again, like logic.

It proposes that we use a single universe of names for all individuals: URIs.

URIs are interesting in part because they are simultaneously a unified set of names and a distributed system. Using them, we can eliminate ambiguity (if URIs are correctly used), though not synonymy.

Contrast naming disciplines in SQL, DTDs, programming languages, first-order predicate calculus, or natural language, with uncoordinated naming.

Contrast also naming disciplines involving central authority (’use-mine-or-nothing’). If I remember correctly, there are central authorities who control Linnaean nomenclature, and names for specific geological formations, and the names of compounds given in official pharmacopeias.

It provides a single model of property attribution, namely the binary predicate, and thus gives us three well known roles (subject, verb, object, or relation-name, first-argument, second-argument) for participants in relations.

This simplicity, together with the lack of ambiguity in URIs when properly used, means that merger of arbitrary sets of triples is safe and easy. When predicates of arbitrary arity are allowed, merger can be more complex, or less effective, because when two sets of normalized relations are merged in a straightforward way, the result is not necessarily normalized. When things are resolved to triples, they are always in normal form. So the primary reasons for sub-optimal results after merging sets of triples are failures to merge owing to undetected synonymy, entailment relations other than synonymy (variation in specificity), variation in methods of currying n-ary predicates, and orbis-tertius variation.

“Orbis-tertius variation? What on earth are you talking about?” “Nothing on earth! It’s my short-hand way of talking about the radically underdetermined nature of our ad hoc ontologies. It’s a reference to Borges’s story Tlön, Uqbar, Orbis Tertius, my favorite treatise on ontology. Think of it as …” “… kind of an hommage. Right,” I said.

Ontological variation, by contrast, which shows up in variable-arity systems as difference of opinion about just what domains should be regarded as involved in an n-ary relation, does not cause problems for triples.

After he left, I realized I have no idea what Enrique meant by this.

Like the element/attribute distinction and child/parent, sibling/next-sib relations in SGML, this is a very thin standardization layer; it means almost nothing (which is why there can be so many sources of semantic variation). And again like the element/attribute model of SGML, that little turns out to be quite a lot, merely because that thin layer of standardization provides hooks that allow software to provide meaningful and useful operations defined in terms of those three roles. These operations can be performed without the software having the least idea of the meaning of the data (which is one reason it is so bizarre that Semantic Web enthusiasts insist so fervently on the implausible claim that the semantics of RDF data are overt in ways the semantics of other formats are not — I suspect the problem is that those particular enthusiasts think the distinction between circles and arrows counts as ’semantics’.)

The semantic advantage of both SGML and RDF over some of their more obvious alternatives is this: precisely because they don’t define a prescribed semantics, the user can model whatever the user is interested in, using the primitive objects and relations built into the system to model whatever they wish to take as the primitive objects and relations of the system they are interested in representing. When this works well, operations on the primitive objects and relations can be used to model operations in the application domain, and the user has the feeling of being able to work ‘directly’ with the concepts of the application domain, with reduced need to pay attention to details of the representation. The ‘universality’ achieved by such a semantically thin layer of primitive notions is exactly parallel to the universality of s-expressions and relations and it is not surprising that the advantages we feel to accrue from XML and/or RDF are very similar to the advantages claimed for s-expressions by Lisp enthusiasts and for the relational model by Codd, Date, and the relational warriors of the 1970s and 1980s.

Because the primitive notions of RDF (things and properties) are explicitly tied to ideas of modeling, they feel (at least to believers) more nearly ’semantic’ than the notions of other systems (e.g. XML or s-expressions). The thinness of the triple layer can be an advantage, not only in simplifying the universe of possible primitive operations, but also in reducing threshold anxiety. (More elaborate modeling systems invariably require something like a leap of faith; RDF’s tenets are thin enough and bland enough to make its required leap of faith somewhat smaller and less frightening.) And thin as it is, the subject/verb/object model does allow an infrastructure that knows nothing of the semantics of the information to do a lot of useful things, just as the semantics of the relational model allow RDBMS which understand nothing of application semantics to do a lot of useful things.

Some things it does not do (although prominent exponents of the Semantic Web sometimes speak as if it did):

  • provide ’self-describing data’ (if such a thing exists at all)
  • ensure that ‘the semantics’ of data are always explicit or always understood
  • guarantee that data from different sources are usefully mergeable
  • tell us how to understand, model, formalize our data
  • tell us how to validate our data
  • tell us how to express complex relations clearly (this problem is not only not addressed by RDF; RDF does as much as any notation or model can to render it insoluble)
    This may have been true when Enrique first wrote this, but the situation has perhaps changed with the publication of the Note Defining N-ary Relations on the Semantic Web, published in 2006, which recommends that tuples be reified with a gay abandon that might cause even avowed Platonists to pause and wonder whether all of those things really should be treated as individuals by our logic. Determined nominalists may be horrified by the recommendation, but it’s no longer true to say that the RDF community doesn’t say how to handle the problem.

Some fans of RDF will perhaps feel that Enrique has shortchanged RDF here, but I have to say that Enrique’s arguments have gone a long way toward making me think RDF could be useful, even if I am still not a committed as I suspect some of my friends in the W3C’s Semantic Web Activity would like.

If this message in a bottle is ever read by anyone, I will be interested to hear back from you on whether you find Enrique’s analysis persuasive.

Eleemosynary RDF

July 14th, 2008

[14 July 2008]

I spent last week in meetings with (among others) a number of enthusiastic proponents of RDF. The meetings were for the most part quite useful and constructive: we spent a lot of time trying to come to grips with the fact that W3C is investing a lot of time and effort in what look like parallel and competing stacks for RDF and XML, and trying to find our way to a simple story about how the two relate. And as one colleague said: no one was trying to “win”, everyone was just trying to understand and solve the problem.

My evil twin Enrique elbowed me in the ribs when he heard this, and suggested that this charitable generalization had some exceptions. You have to make some allowances for Enrique: re-encountering the rhetoric of RDF advocates at close range had put him in a bad mood at what he regards as the tone-deaf style of some arguments for using RDF. A full catalogue would take a long time (and would only lead to bad feeling), but during a lull in the meeting, Enrique whispered to me “Listen! If you listen carefully to the rhetoric, what you hear is that none of these RDF peeople believe in their hearts that using RDF is useful for the person or institution who actually creates and maintains the data! It’s all about making things easy for other people, about you eating the vegetables so I can eat dessert, about taking one for the team. I bet the records of Nurmengard were kept in RDF!” “Hush!” I said. “People will hear you.” But I have to admit, he had a point.

You should use RDF, the argument frequently goes, because if you do, then we can reuse your data much more conveniently.

When one of the meetings was considering a possible list of speaking points, I suggested (to keep Enrique quiet) that the point about using RDF so that other people could reuse the data more easily might perhaps be recast to suggest that using RDF could help the creators of the data better achieve their own goals. Sometimes the primary goal of data collection to make the data available for others to use. But often, in the real world, those who collect data do so primarily for their own purposes, and asking them to incur a cost in order that others may benefit seems to require a higher level of altruism than commercial, educational, or governmental institutions always exhibit.

No, a colleague replied emphatically, the point of the semantic web is that you incur costs now so that others can benefit later.

After several days, I’m still uncertain whether he was indulging in sarcasm, irony, or persiflage by parodying my paraphrase of the draft speaking point, or whether he was stone cold serious. At the break, Enrique went out and painted “For the greater good” over the entrance to the building where the meeting was being held, and wrote “Welcome to Nurmengard” on the whiteboard, but thankfully someone erased it before the meeting resumed.

I was reminded of something I once heard Jean Paoli say about persuading people to try a new technology. Using an unfamiliar technology requires an investment of effort, and the user you are trying to persuade needs to see that investment paid back very quickly. If someone puts five cents of effort in, Jean said, they want to see a nickel paid back in return, and preferably right away.

Note: Enrique reminds me that not everyone who reads English can keep the colloquial terms for American coins straight. A “nickel” is a coin worth five cents. (So Jean Paoli was saying people want to break even right away on their effort, not that they want to show a profit.) Oh, and Nurmengard is the prison built by the dark wizard Grindelwald to house his opponents; it had “For the greater good” carved over its entrance. (Enrique says that that is overkill: there isn’t any adult left who hasn’t read the Harry Potter books, so that gloss is unnecessary.)

One of the reasons I found Tom Passin’s talk about his use of RDF persuasive and interesting, last August in Montreal, was that he suggested plausibly that in the situation he described, using RDF might have short-term benefits, not just pie in the sky by and by. I think I’m as interested in long-term benefits as the next person. But a technology seems likely to achieve better uptake if using it brings some benefit to those who use it, independent of the network effect. Why do so many proponents of RDF behave as though they can’t actually think of any benefit of RDF, except the network effect?

Digital Humanities 2008

June 29th, 2008

After two days of drizzle and gray skies, the sun came out on Saturday to make the last day of Digital Humanities 2008 memorable and ensure that the participants all remember Finland and Oulu as beautiful and not (only) as gray and wet and chilly. Watching the sun set over the water, a few minutes before midnight, by gliding very slowly sideways beneath the horizon, gave me great satisfaction.

The Digital Humanities conference is the successor to the annual joint conference of the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC), now organized by the umbrella organization they have founded, which in a bit of nomenclature worthy of Garrison Keillor is called the Association of Digital Humanities Organizations.

There were a lot of good papers this year, and I don’t have time to go through them all here, since I’m supposed to be getting ready to catch the airport bus. So I hope to do a sort of fragmented trip report in the form of followup posts on a number of projects and topics that caught my eye. A full-text XML search engine I had never heard of before (TauRo, from the Scuola Normale Superiore in Pisa), bibliographic software from Brown, and a whole long series of digital editions and databases are what jump to my mind now, in my haste. The attendance was better than I had expected, and confirmed what some have long suspected: Kings College London has become the 800-pound gorilla of humanities computing. Ten percent of the attendees had Kings affiliations, there was an endless series of reports on intelligently conceived and deftly executed projects from Kings, and Kings delegates seemed to play a disproportionately large role in the posing of incisive questions and in the interesting parts of discussions. There were plenty of good projects done elsewhere, too, but what Harold Short and his colleagues have done at Kings is really remarkable — someone interested in how institutions are built up to eminence (whether as a study in organizational management or because they want to build up some organization) should really do a study of how they have gone about it.

As local organizer, Lisa-Lena Opas-Hänninen has done an amazing job, and Espen Ore’s program committee deserves credit for a memorable program. Next year’s organizers at the University of Maryland in College Park have a tough act to follow.

XSD 1.1 is in Last Call

June 21st, 2008

Yesterday the World Wide Web Consortium published new drafts of its XML Schema Definition Language (XSD) 1.1, as ‘last-call’ drafts.

The idiom has an obscure history, but is clearly related to the last call for orders in pubs which must close by a certain hour. The working group responsible for a specification labels it ‘last call’, as in ‘last call for comments’, to indicate that the working group believes the spec is finished and ready to move forward. If other working groups or external readers have been waiting to review the document, thinking “there’s no point reviewing it now because they are still changing things”, the last call is a signal that the responsible working group has stopped changing things, so if you want to review it, it’s now or never.

The effect, of course, can be to evoke a lot of comments that require significant rework of the spec, so that in fact it would be foolish for a working group to believe they are essentially done when they reach last call. (Not that it matters what the WG thinks: a working group that believes last call is the end of the real work will soon be taught better.)

In the case of XSD 1.1, this is the second last call publication both for the Datatypes spec and for the Structures spec (published previously as last-call working drafts in February 2006 and in August 2007, respectively). Each elicited scores of comments: by my count there are 126 Bugzilla issues on Datatypes opened since 17 February 2006, and 96 issues opened against Structures since 31 August 2007. We have closed all of the substantive comments, most by fixing the problem and a few (sigh) by discovering either that we could not reach consensus on what to do about the problem (or in some cases could not reach consensus about whether there was really a problem before us) or that we could not make the requested change without more delay than seemed warrantable. There are still a number of ‘editorial’ issues open, which are expected not to affect the conformance requirements for the spec or to change the results of anyone’s review of the spec, and which we therefore hope to be able to close after going to last call.

XSD 1.1 is, I think, somewhat improved over XSD 1.0 in a number of ways, ranging from the very small but symbolically very significant to much larger changes. On the small but significant side: the spec has a name now (XSD) that is distinct from the generic noun phrase used to describe the subject matter of the spec (XML schemas), which should make it easier for people to talk about XML schema languages other than XSD without confusing some listeners. On the larger side:

  • XSD 1.1 supports XPath 2.0 assertions on complex and simple types. The subset of XPath 2.0 defined for assertions in earlier drafts of XSD 1.1 has been dropped; processors are expected to support all of XPath 2.0 for assertions. (There is, however, a subset defined for conditional type assignment, although here too schema authors are allowed to use, and processors are allowed to support, full XPath.)
  • ‘Negative’ wildcards are allowed, that is wildcards which match all names except some specified set. The excluded names can be listed explicitly, or can be “all the elements defined in the schema” or “all the elements present in the content model”.
  • The xs:redefine element has been deprecated, and a new xs:override element has been defined which has clearer semantics and is easier to use.

Some changes vis-a-vis 1.0 were already visible in earlier drafts of 1.1:

  • The rules requiring deterministic content models have been relaxed to allow wildcards to compete with elements (although the determinism rule has not been eliminated completely, as some would prefer).
  • XSD 1.1 supports both XML 1.0 and XML 1.1.
  • A conditional inclusion mechanism is defined for schema documents, which allows schema authors to write schema documents that will work with multiple versions of XSD. (This conditional inclusion mechanism is not part of XSD 1.0, and cannot be added to it by an erratum, but there is no reason a conforming XSD 1.0 processor cannot support it, and I encourage makers of 1.0 processors to add support for it.)
  • Schema authors can specify various kinds of ‘open content’ for content models; this can make it easier to produce new versions of a vocabulary with the property that any document valid against the new vocabulary will also be valid against the old.
  • The Datatypes spec includes a precisionDecimal datatype intended to support the IEEE 754R floating-point decimal specification recently approved by IEEE.
  • Processors are allowed to support primitive datatypes, and datatype facets, additional to those defined in the specification.
  • We have revised many, many passages in the spec to try to make them clearer. It has not been easy to rewrite for clarity while retaining the kind of close correspondence to 1.0 that allows the working group and implementors to be confident that the rewrite has not inadvertently changed the conformance criteria. Some readers will doubtless wish that the working group had done more in this regard. But I venture to hope that many readers will be glad for the improvements in wording. The spec is still complex and some parts of it still make for hard going, but I think the changes are trending in the right direction.

If you have any interest in XSD, or in XML schema languages in general, I hope you will take the time to read and comment on XSD 1.1. The comment period runs through 12 September 2008. The specs may be found on the W3C Technical Reports index page.