XML Watch: Tracking provenance of RDF data

Country/region [select]

Home

Products

Services & solutions

Support & downloads

My account

developerWorks > XML >


	XML Watch: Tracking provenance of RDF data

Contents:

Storing RDF

Practical implementation

Related content:

Finding friends with XML and RDF

Support online communities with FOAF

Subscriptions:

dW newsletters

RDF tools are beginning to come of age

Level: Intermediate

Edd Dumbill (edd@xml.com)
Editor and publisher, xmlhack.com
21 Jul 2003

When you start aggregating data from around the Web, keeping track of where it came from is vital. In this article, Edd Dumbill looks into the contexts feature of the Redland Resource Description Format (RDF) application framework and creates an RDF Site Summary (RSS) 1.0 aggregator as a demonstration.

A year ago, I wrote a couple articles for developerWorks about the Friend-of-a-Friend (FOAF) project. FOAF is an XML/RDF vocabulary used to describe -- in computer-readable form -- the sort of personal information that you might normally put on a home Web page, such as your name, instant messenger nicknames, place of work, and so on.

In Listing 6 of my second article on FOAF (see Resources), I demonstrated FOAFbot, a community support agent I wrote that aggregates people's FOAF files and answers questions about them. FOAFbot has the ability to record who said what about whom. When asked what my name was, FOAFbot responded:


edd@xml.com's name is 'Edd Dumbill', 
according to Dave Beckett, Edd Dumbill, 
Jo Walsh, Kip Hampton, Matt Biddulph, 
Dan Brickley, and anonymous source Anon47

The idea behind FOAFbot is that if you can verify that a fact is recorded by several different people (whom you trust), you are more likely to believe it to be true.

Here's another use for tracking provenance of such metadata. One of the major abuses of search engines early on in their history was meta tag spamming. Web sites would put false metadata into their pages to boost their search engine ranking. For this reason, search engines stopped paying attention to meta tags because they were most likely lies. Instead, search engines such as Google found other more sophisticated metrics to rank page relevance.

Looking toward the future of the Web, it will become vital to avoid abuses such as meta tag spamming. Tim Berners-Lee's vision for a Semantic Web (see Resources) aims for a Web where most data is machine-readable, in order to automate much of the information processing currently done by humans.

The potential difficulties of metadata abuse are even larger on the Semantic Web: A Web site would no longer be restricted to making claims only about itself. It could also make claims about other sites. It would be possible, for instance, for one bookstore to make false claims about the prices offered by a competitor.

I won't go into detail on the various security and trust mechanisms that will prevent this sort of semantic vandalism, but I will focus on the foundation that will make them possible: tracking provenance.

Storing RDF
Applications that process RDF and then store it use triple stores to do this. The RDF/XML input documents are decomposed into a list of subject, predicate, object triples. Subsequent processing then takes the form of manipulating and querying the triples in the store. The XML syntax of RDF is only used at points of interchange. As the data is directly decomposed into triples, XML tools like XPath or XQuery aren't a lot of use. Some people have written RDF processing tools that manipulate the XML syntax directly using XQuery, but I consider this a bit of a red herring for general-purpose RDF processing.

To demonstrate, I'll show you how to use a simple RSS 1.0 document as test data. Recently I set up a weblog site where I force my opinions on the unsuspecting public. To syndicate metadata about my scribblings, I produce an RSS 1.0 file, (see Resources for a link). The beginning of the file looks like Listing 1 and should be pretty familiar to anyone who's ever seen RSS.

Listing 1. Excerpt from the beginning of an RSS 1.0 file

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns="http://purl.org/rss/1.0/"
>
  <channel rdf:about="http://usefulinc.com/edd/blog">
    <title>Edd Dumbill's Weblog: Behind the Times</title>
 
    <description>
      Thoughts and comment from Edd Dumbill, technology writer 
      and free software hacker.
    </description>
    <link>http://usefulinc.com/edd/blog</link>

When this is parsed into triples by an RDF processor, I obtain the data shown in Listing 2.

Listing 2. RDF triples corresponding to the beginning of Listing 1

[http://usefulinc.com/edd/blog,
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type,
    http://purl.org/rss/1.0/channel]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/title,
    "Edd Dumbill's Weblog: Behind the Times"]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/description,
    "Thoughts and comment from Edd Dumbill,
     technology writer and free software hacker."]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/link,
    "http://usefulinc.com/edd/blog"]

It is the list of triples shown in Listing 2 that is then acted on by RDF applications. So far, so good. However, when storing this data you can lose track of some important information, namely the place where the data came from and other associated data such as when I took a snapshot of it. To record this information, I need some way to associate that information with the RDF statements I've found.

First, I need to mock up a description of what I might want to say about the data in Listing 2. Listing 3 contains such an example description, using an example namespace invented for this purpose.

Listing 3. Example description of retrieving the data in Listing 1 and Listing 2

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:me="http://example.org/ns/mymeta#">

<rdf:Description rdf:about="http://example.org/retrieval/52365">
    <me:origin rdf:resource="http://usefulinc.com/edd/blog/rss" />
    <me:retrieved>2003-06-24T08:59:55.00Z</me:retrieved>
</rdf:Description>
</rdf:RDF>

You can see from Listing 3 that I've (arbitrarily) invented a URI to represent the 52,365th retrieval of a file. This seems as reasonable a way as any to name each poll of a remote resource. The origin of this resource is the URI of the RSS 1.0 file, and a timestamp denotes when it was found.

Now, all that's left is to store quads rather than triples. Listing 4 shows how I can revise Listing 2 to show all the data I now add into my store.

Listing 4. Triples augmented with a context URI and context metadata

[http://usefulinc.com/edd/blog,
    http://www.w3.org/1999/02/22-rdf-syntax-ns#type,
    http://purl.org/rss/1.0/channel,
    http://example.org/retrieval/52365]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/title,
    "Edd Dumbill's Weblog: Behind the Times",
    http://example.org/retrieval/52365]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/description,
    "Thoughts and comment from Edd Dumbill,
     technology writer and free software hacker.",
    http://example.org/retrieval/52365]
[http://usefulinc.com/edd/blog,
    http://purl.org/rss/1.0/link,
    "http://usefulinc.com/edd/blog",
    http://example.org/retrieval/52365]
[http://example.org/retrieval/52365,
    http://example.org/ns/mymeta#origin,
    http://usefulinc.com/edd/blog/rss,
    <NULL>]
[http://example.org/retrieval/52365,
    http://example.org/ns/mymeta#retrieved,
    "2003-06-24T08:59:55.00Z",
    <NULL>]

The idea illustrated in Listing 4 is that each RDF statement is augmented with a URI that links it to the metadata I want to store about it. This simple mechanism gives me a lot of power. Besides enabling me to retrieve metadata about the RSS file, it provides a handy way to remove that information from the store.

Practical implementation
In working with this idea, I've called the fourth element of the quad above a context. This is by no means a canonical term, but it works well for me. Furthermore, the same term is used in the Redland RDF application framework, which I'm going to use to demonstrate an application of this provenance tracking.

(Incidentally, I owe a debt of gratitude to Dave Beckett, the creator of Redland. When I was writing FOAFbot last year, Redland did not have support for contexts, so I ended up implementing them in a very roundabout fashion. In response to my requests, Dave added support for contexts into his toolkit.)

Redland is a C-based toolkit with many language bindings, including Python, Perl, and Java. It's comprised of an RDF parser, raptor, and a data store. The store currently uses Berkeley DB files, but support for underlying SQL stores is underway. For my example, I'll use the Python bindings to Redland.

The objective here is to create a simple RSS 1.0 aggregator. The purpose of an aggregator is to take repeated snapshots of multiple RSS feeds and allow you to combine them in interesting ways. RSS files change over time as new content items are added -- I want to avoid multiple redundant entries, yet still retain historical items. Later in this article, I'll develop some of the functionality required.

You'll find a link to the Python code for this project (fraggle.tar.gz) in Resources. There is also a link to the Redland toolkit, which you'll need to install as well.

The interesting work of fetching and storing the RDF data happens in the Aggregator class. Listing 5 shows an excerpt from its load_uri method, starting at line 189 of aggregate.py.

Listing 5. Tracking context of the retrieved RDF

stream = self._parser.parse_as_stream(
        RDF.Uri(string="file:./%s" % fname),
        base_uri=urinode.uri)
if stream:
    channel = None
context = self.context_uri_node()
    timestamp = RDF.Node(literal=
        time.strftime("%Y-%m-%dT%H:%M:%S.00Z"))
    while not stream.end():
        statement = stream.current()
        # add the statement to the model, with context
self._model.add_statement(statement, context)

        # if it's a <rss:channel> remember the URI
        if ( statement.predicate == _rdfType and
            statement.object == _rssChannel ):
            channel = RDF.Node(node=statement.subject)

        # move on
        stream.next()

	# now to add the context information
	# first, the source URI
	self._model.add_statement(RDF.Statement(
	subject=context, predicate=_fraggieSource,
	object=urinode), _globalContext)
	# second, the channel URI
	self._model.add_statement(RDF.Statement(
	subject=context, predicate=_fraggieChannel,
	object=channel), _globalContext)
	# third, the timestamp
	self._model.add_statement(RDF.Statement(
	subject=context, predicate=_fraggieTimestamp,
	object=timestamp), _globalContext)
	# fourth, the checksum
	self._model.add_statement(RDF.Statement(
	subject=context, predicate=_fraggieChecksum,
	object=RDF.Node(literal=checksum)), _globalContext)
	self.register_fetch(urinode, context)

The bolded lines in the listing are those relevant to tracking context. First, the URI for the context is generated using context_uri_node(). This returns a URI result of the form http://usefulinc.com/fraggie/global/1. This context is then appended to every statement found in the retrieved RSS. Once the RSS data has been stored, I then add data about the context URI into the store. In this case, I store away the URI that the RSS file came from, the URI of the RSS channel itself (the value of rdf:about in the rss:channel element), the time I fetched the file, and the MD5 checksum of the file (to determine later whether an RSS file has changed over time).

As you will see if you download the source, the rest of the Aggregator class fulfills two functions: The first of these is the necessary housekeeping methods for the RSS spidering; the second is to provide query methods for interrogators of the aggregator. Note that all housekeeping variables, such as the fetch count, are expressed in RDF and kept in the RDF store. Listing 6 shows the expression of the count variable as an RDF/XML statement.

Listing 6. The internal counter expressed in RDF

<rdf:Description 
    rdf:about="http://usefulinc.com/fraggie/counter"    
    rdf:value="0" />

I have found that it makes sense to persist any variable with a global scope in the RDF store. One immediate advantage of this approach is that it allows state to be preserved over multiple invocations.

The demonstration archive includes a couple of example RSS files you can run. Invoke the demonstration with python fraggle.py. The demo first makes the aggregator load the example RSS files, both of which reference a recent article by Mark Pilgrim on XML.com. The aim of the exercise is to find out who said what about this article. Running the demo gives the output shown in Listing 7 (some of the output lines have been reformatted for readability).

Listing 7. Output from fraggle.py

Links to http://www.xml.com/pub/a/2003/07/02/dive.html
 
From : Meerkat: An Open Wire Service: XML.com 
       <http://meerkat.oreillynet.com/>
Time : 2003-07-05T15:39:26.00Z
Title: The Vanishing Image: XHTML 2 Migration Issues
Desc :  In Mark Pilgrim's latest Dive Into XML column,
       Pilgrim examines XHTML 2.0 <tt>object</tt>
       element, which is a replacement for the more
       familiar and widely supported <tt>img</tt>.
 
From : paranoidfish.org/links
       <http://www.paranoidfish.org/links/>
Time : 2003-07-05T15:39:25.00Z
Title: XML.com: The Vanishing Image: XHTML 2 Migration
       Issues [Jul. 02, 2003]
Desc : using <object> as a replacement for <img>
       is not a safe bet right now

The first excerpt from XML.com shows the official description. The second excerpt shows the summary provided by the owner of the paranoidfish.org Web site.

If you examine the source in fraggle.py, you will see that all the querying is centered around contexts. First, the aggregator is asked which contexts mention the article's URI. Then article metadata is requested, indexed by context. Note that a context corresponds to a snapshot of a feed in time. If my example were to obtain snapshots of more RSS files over time, you might see multiple entries from the same source -- perhaps with slightly changing descriptions (people often fix spelling mistakes on the fly). One obvious improvement to the query in fraggle.py would be to group by source, rather than pursuing just a time-ordered display.

Though RSS feeds of weblogs and other Internet sites are interesting from a browse-around, ego-surfing perspective, I believe the real value of a project like this is likely to be within the enterprise. Organizations are excellent at generating vast flows of time-sequenced data. To take a simple example, URIs are allotted for things like customers or projects, then RSS flows of activity could be generated and aggregated.

Such aggregated data could then be easily sliced and diced for whoever was interested. For instance, administrators might wish to find out what each worker has been doing, project managers might want the last three status updates, higher-level management might want a snapshot view of the entire department, and so on. It is not hard to imagine how customer relationship management (CRM) might prove to be an area where tools of this sort would yield great benefits.

Conclusion
The simple example demonstrated in this article only scratches the surface of provenance tracking with RDF. On the Web, where information comes from is just as important as the information itself. Provenance-tracking RDF tools are just beginning to emerge, and as they become more widely used they will no doubt become more sophisticated in their abilities.

The Redland RDF application framework is a toolkit that's definitely worth further investigation. It has interfaces to your favorite scripting language; it runs on UNIX, Windows, and Mac OS X; and it is an open-source project, so any improvements you make could benefit the entire community.

Resources

Check out the RDFWeb site, home of the FOAF project.
Read "Finding friends with XML and RDF," the author's first article describing the Friend-of-a-Friend project (developerWorks, June 2002).
Read "Support online communities with FOAF," which contains the Listing 6 referred to above (developerWorks, August 2002).
Explore the W3C's Semantic Web home page which explains the Semantic Web vision and ongoing work on its development at the W3C.
Browse the RSS 1.0 specification which describes the core vocabulary and extension mechanisms of RSS.
Link to the author's own RSS file, which carries metadata about recent entries in his weblog.
Read the open-source Redland RDF Application Framework," written by Dave Beckett, for the framework used in the examples in this article.
Check out Redland's Python bindings which enable easy prototyping of RDF-based applications in the Python scripting language.
Download the source code for fraggle and the example project: fraggle.tar.gz.
Find more XML resources on the developerWorks XML zone. Read previous installments in the XML Watch column series.
Get IBM WebSphere Studio, a suite of tools that automate XML development, both in Java and in other languages. It is closely integrated with the WebSphere Application Server, but can also be used with other J2EE servers.
Find out how you can become an IBM Certified Developer in XML and related technologies.

About the author
Edd Dumbill is managing editor of XML.com and the editor and publisher of the XML developer news site XMLhack. He is co-author of O'Reilly's Programming Web Services with XML-RPC, and co-founder and adviser to the Pharmalicensing life sciences intellectual property exchange. Edd is also program chair of the annual XML Europe conference. You can contact Edd at edd@xml.com.

developerWorks > XML >

About IBM

Privacy

Contact