Don’t call me DOM

7 July 2004

A Semantic Web protocol for updates to a knowledge base

Filed under:

For quite some time now (from our Team-only archives, I find a reference back in September 2002) I have been thinking to deploy the following protocol to update RDF knowledge bases, that takes advantage of RDF mergeability and of the concept of filtering in cwm (the command line tool for Semantic Web operations developed by the SWAD Team at W3C).

The general idea is to allow one to HTTP POST a chunk of RDF to an RDF file on a Web server, and have the RDF chunk added to the file, but with the possibility of filtering who gets to add what type of data. Let’s get a little more in the details…

cwm filtering

When using cwm with the --filter=file.n3 command line, it applies the rules defined in file.n3 to its current RDF store (obtained from previous command line arguments), and outputs only the RDF statements that were matched by the said rules. For instance, let’s say that cwm RDF Store contains the following statements:


<http://www.w3.org/TR/2004/NOTE-grddl-20040413/>        a :NOTE;
         dc:date "2004-04-13";
         dc:title "Gleaning Resource Descriptions from Dialects of Languages (GRDDL)";
         doc:versionOf ;
         :cites ;
         :editor  [
             contact:fullName "Dominique Haza\u00ebl-Massieux" ],
                 [
             contact:fullName "Dan Connolly" ] .

Now, if I want to keep only the title and the date of the given document, I would use the following N3 rules as a filter to this store:


{
 ?DOC dc:date ?DATE; dc:title ?TITLE.
} log:implies {
 ?DOC dc:date ?DATE; dc:title ?TITLE.
}.

Of course, the rule above is pretty dumb; looked from the logic point of view, it’s asserting A ⇒ A which is useless; but used in a filtering context, this will only matches 2 statements from the ones above:


     <http://www.w3.org/TR/2004/NOTE-grddl-20040413/>
         dc:date "2004-04-13";
         dc:title "Gleaning Resource Descriptions from Dialects of Languages (GRDDL)";

cwm –filter would only output these 2 statements. This can be thought as a query mechanism for cwm (see how filtering relates to N3QL).

Selective write access to an RDF file

So, how is this filtering mechanism possibly useful for an update protocol for RDF resources?

Well, let’s say you have an RDF knowledge base you want to allow as many people as possible to participate in building; still, you’ll need to set some restrictions to this access most of the time:

  1. you don’t want your knowledge base to become the Great Encyclopedia of Everything – you want to keep it focused on a given topic
  2. nor would you want to become inconsistent, i.e. that it asserts both that A is true and false
  3. you could also decide that some type of users should be allowed to update a given type of RDF statements while others should restricted to this other type
  4. some statements in this knowledge base may needs to stay whatever input is proposed; others would need to be replaced (e.g. when 2 states are not compatible, a newer one should update an older one)

While some of these restrictions can be controlled either through well-defined ACLs, or through a user interface enforcing constraints, the former lacks the granularity often needed (as in case 3), and the latter is usually long and cumbersome to develop.

The approach I’m proposing is to filter the proposed input to the RDF file through one or more cwm filters; each of the cases above can be described a N3 rules, either directly on the proposed statements (case 1, 2 and 4), or on the metadata that could be attached with the input (e.g. who submitted the new statements, for case 3), or even on the schema/ontology attached to the vocabulary used in the knowledge base (which could be particularly useful for case 4).

For instance, if only people who have been identified as part of the W3C Web Team should be allowed to say that a W3C Technical Report obsoletes a previous one, I could encode this with the following rules:


{
 ?INPUT prot:postedBy [ web:username ?LOGIN ].
 <http://www.example.org/w3c_database/groupMemberships> log:semantics
      [ log:includes { [ web:username ?LOGIN] org:belongs <http://www.w3.org/Systems/db/webId?group=w3t_webteam> } ].
 ?INPUT log:semantics [ log:includes { ?NEWTR doc:obsoletes ?OLDTR } ].
} log:implies {
 ?NEWTR doc:obsoletes ?OLDTR.
}.

(Note that to enable this, we would need to make sure that the code supporting the protocol adds the relevant metadata to the input)

Linking rules and data

Once these rules are defined, how should they be taken into account in our update protocol?

The obvious idea on the Web is to link them to the data they should be applied to; although there are probably many other ways to link them, I like the idea of having a link directly in the data file, à la:


<> dc:title "My RDF Data";
    prot:updateRules <http://www.example.org/my-update-rules>.

The advantages to this approach are:

  • the person that has full write access to the RDF data only needs to add this rule to his file to make it open as it needs
  • it allows to add as many rules as needed (although, maybe using another property accepting lists would allow to chained rules)
  • it allows to import rules defined elsewhere and re-use them as needed

Since RDF makes it easy to put both data and metadata in the same context, it looks like a simple enough solution; but there may be better ways to accomplish this.

HTTP-based protocol

Based on the ideas developed above, we can define an HTTP-based protocol to update RDF resources (a start of implementation of this protocol in python is provided thereafter):

  • to update a resource R, a client C sends an HTTP POST request with a body B containing RDF to R

    
    POST R HTTP/1.0
    Content-Type: application/rdf+xml
    Content-Length: ...
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rec="http://www.w3.org/2001/02pd/rec54#" xmlns:doc="http://www.w3.org/2000/10/swap/pim/doc#"> <rec:NOTE rdf:about="http://www.w3.org/TR/2004/NOTE-grddl-20040413/"> <doc:versionOf rdf:resource="http://www.w3.org/TR/grddl/"/> </rec:NOTE> </rdf:RDF>
  • the web server S responding to C runs B (possibly augmented with metadata from the request, as evoked above – this would need to be better defined) through one or more filters, referenced from the body of R
  • it merges the output of this filter back into B and saves it at R (if there is indeed anything that came through it
  • it returns R with an HTTP status code of 200; if the update request is going to be handled asynchronously, it should return a 202 Accepted instead
  • if there was actually nothing coming out of the filter, it may be interesting to send a 403 Forbidden to denote that the update was refused by the server in the given conditions

Of course, in addition to the ACLs you may set thanks to the filter (as proposed above), you can still use the ACLs-capacity of your web server to restrict POST access to the given resource.

A prototype implementation in Python

Since I had several occasions where I thought that having such a protocol would be useful for my day to day work, I ended up coding a prototype, in python, both because that’s probably my preferred scripting language and because it makes it really easy to integrate cwm code in it.

I have not made publicly available the complete code since there are some W3C-specific parts in it, that I think are better not publicized (nor are very interesting to the rest of the world). But I’m including herein a few code snippets that show the gist of it.

The general idea is that HTTP POST requests on a given URI R are proxied to this CGI script, with a uri parameter set to R; then the CGI script handles the protocol as described above.

Modules imported

Besides the classical python modules, I’m importing some of the modules defined in cwm‘s code base, and thus, adding them in the module path (note that this code was written in November 2003, and cwm source code and organization has likely changed a lot since then, so the code is likely broken):


# import sys
# to get access to cwm modules
sys.path.insert(0, "/usr/local/lib/python2.2/site-packages/swap")
import os import string import urlparse import myStore import llyn import notation3 import toXML from RDFSink import FORMULA import cgi
Parsing RDF and N3

This is basically a copy and paste from cwm.py if I remember correctly, with a bit of simplifications for the purpose of this script; I’m not sure why I copied rather than imported it…


def getParser(format, inputURI, formulaURI, flags=None):
    global store2
    """Return something which can load from a URI in the given format, while
    writing to the given store.
    """
    if format == "rdf" :
        from rdflib2rdf import RDFXMLParser
	return RDFXMLParser(store2, inputURI, formulaURI=formulaURI,
					flags="")
    elif format == "n3":
        return notation3.SinkParser(store2, inputURI, formulaURI=formulaURI)
HTTP Interface

The most important one is the HTTP POST interface, but I also coded a GET one so that the script can be used independently of the redirect.

The (very simple) code below only takes care of loading the uri parameter; the real script should also take care of loading and parsing the incoming RDF in the body of the POST request, while keeping the uri parameter as a GET parameter. When I wrote this code, the way the RDF input was supposed to be provided wasn’t in the body of POST…


fields = cgi.FieldStorage()
if os.environ.has_key("REQUEST_METHOD"):
    if os.environ["REQUEST_METHOD"] == 'GET' and fields.has_key("uri"):
        uri = fields["uri"].value
    elif os.environ["REQUEST_METHOD"] == 'POST' and os.environ.has_key("QUERY_STRING"):
        args = cgi.parse_qs(os.environ["QUERY_STRING"])
        if args.has_key("uri"):
            uri = args["uri"][0]
        else:
            print_output( GET_form)
    else:
        print_output( GET_form)
else:    
    print_output( GET_form)
Loading the filters

So, now we have a uri available, we load it and parse it to find whether there are filters linked from there (note the fictious RDF property invented for this implementation http://www.example.org/2003/10/rdfUpdateProtocol#updateRules, and which would need to be defined):


errors = []
warnings = []
messages = []
# Parsing the input RDF to look up for filters
store = llyn.RDFStore()
myStore.setStore(store)
try:
    f = store.load(uri)
except:
    print "Status: 500 Internal Error"
    print Page % ("""<p class='error'>The <a href="%s">submitted RDF document</a> could not be loaded; make sure it is a valid RDF/XML document.</p>""" % uri)
messages.append("Parsing <code>%s</code>" % uri)
checkpoint()
this_doc = f.newSymbol(uri)
filterProp = f.newSymbol("http://www.example.org/2003/10/rdfUpdateProtocol#updateRules") filters = f.each(subj=this_doc,pred=filterProp) f.close()

This registers all the filters found in the input document in the filters array. We now loads and parse them in a different RDF Store:


for filter in filters:
        p = getParser("n3", filter.uriref() + "#_formula", uri+"#_formula")
        try:
            p.load(filter.uriref())
        except Exception, inst:
            errors.append("Exception raised when loading filter <code>%s</code>: <code>%s</code>" % (filter.uriref(),inst))
        messages.append("Loading <code>%s</code>" % filter.uriref())
        del(p)

And finally, we apply them to the received RDF Store:


# Thinking about all this
think(workingContext)
checkpoint()
workingContext.close()
workingContext = _newContext

and merge the results to the resource on which the filter was applied:


# Finally, merging the initial data
workingContext.reopen()
p = getParser("rdf", uri + "#_formula3", uri+"#_formula2")
try:
    p.load(uri)
except:
    errors.append("""<a href="%s">Submitted RDF document</a> could not be loaded.""" % uri)
messages.append("Loading <code>%s</code>" % uri)
del(p)
workingContext = workingContext.close()
checkpoint()

The rest of the script was mainly caring about publishing the new RDF Store in lieu of the original resource.

Open issues

Well, even though my implementation was kind of working, I ended up not using it; one reason for that was lack of time to invest in it; but the main one was the difficulty to create a User Interface that would rely on this mechanism for updating data.

Indeed, even the most modern browsers can’t post RDF to a URI through an HTML form as of today, which means it’s pretty to hard to interface this system with any reasonable HTML form; the original script had a POST interface that would parse some well defined url-encoded parameters as N3 statements, but it appeared pretty quickly that this would be way too cumbersome to include in any real HTML form, so I didn’t pursue it.

Still, I remain convinced this protocol can be pretty useful in various cases:

  • XForms-enabled browsers, where posting RDF/XML is trivial, and can be nicely incorporated in a decent interface
  • command line tools and Web Services-based applications, where using an HTTP interface is efficient

Related work

One Response to “A Semantic Web protocol for updates to a knowledge base”

  1. Charles McCathieNevile Says:

    Nice…

    I wonder if it makes sense to look more closely at the terms in the W3C Access Control schema to define these. They define ways to describe access to things of type resource – at the moment there isn’t a standard query language that lets you specify “things with properties XYZ”, so you still end up defining cwm rules in n3, I think.

Picture of Dominique Hazael-MassieuxDominique Hazaël-Massieux (dom@w3.org) is part of the World Wide Web Consortium (W3C) Staff; his interests cover a number of Web technologies, as well as the usage of open source software in a distributed work environment.