URL +1, LSID -1, or why I don't care about the sematic web

There has been an ongoing debate in the semantic web community, for many years now, about whether identifiers for resources should be URLs or URNs. This debate has recently flared up on the HCLS mailing list in the form of URLs+1 LSIDs -1, and since some of you not in the know maybe wondering what all the fuss is about, in this post I will try to explain why identifiers must be URLs and why the semantic web is broken by design.

As an example, take the Uniprot record for SERPINC1:

http://www.expasy.org/uniprot/P01008

From the browser's perspective the only significant portion of this string is everything before the first colon: http. HTTP is of course the protocol the browser will use to fetch the html text file, or representation of the thing being identified, from the server. It can only do this once it knows the location of the server, so taking the URL and then finding the physical address of the web server is actually step one. DNS is the protocol used for this, in short, a series of name servers arranged in a hierarchy are used to turn the name into a location, in this case the address is 192.33.215.47. The browser then sends an HTTP GET request for the page /uniprot/P01008. And what you seen in your browser is a webpage detailing information about the protein SERPINC1. This should not be news to anyone.

Shifting the example slightly, imagine you are a semantic web agent, and not a human being. The above URL is meaningless, or opaque as the W3C specifications describe it. Suppose further that your agent capabilities include understand HTTP and DNS. No problem, resolve the name, fetch the resulting document and then figure out what you have just fetched. As an agent you do not know what the document is describing, you can't read like a human [1]. Text mining to the rescue you say, but unfortunately we are dealing with the semantic web. People who use the semantic web would very much like to use URLs to identify things other than web documents, for example biological entities like the actual physical protein SERPINC1 and not an HTML page describing data about SERPINC1. For example

http://www.example.com/protein/SEPRINC1 is a Protein

Here the URL is identifying the protein SERPINC1, not a webpage about the protein. So as a semantic web agent what do I do with this identifier ? Superficially it is no different to the previous one, but if I try and resolve it what would I expect to get back ? In theory the W3C's family of specifications describing the function of the the semantic web should outline a clear solution to this problem. But they don't. They remain vague on this point, and as a consequence people are just making it up as they go along.

One way to avoid the problem of trying to resolve a URL that identifies a concept rather than a webpage would be to ask for agent (or machine) readable metadata about the resource the URL identifies, before trying to resolve it. Is it a webpage ? is it an image ? or is it a concept that has no physical representation on the web. Logically you would also expect the W3C standards to say something about this, but again they don't. These two problems are the fundamental deal breakers for the semantic web. Unless these are solved (specified) clearly, it will never take off, period.

Besides these problems, people have other issues with URLs as identifiers. First among these is that people consider URLs to be locations, not identifiers and second they're just not permanent enough. So since we don't want to just locate concepts on the semantic web we want to name them, and we certainly don't want our names to change, people have gone to the trouble of inventing another identifier system: URNs or uniform resource names. It is at this point we bring LSIDs into the picture, because LSIDs are a URN naming scheme, coupled with a standard way for agents to ask the question: what does a LSID identify, a concept or some representation on the web ?

Thus, LSID solves all of the problems mentioned above, it allows you to use a name to identify a concept or a document on the web. You can use the name with public LSID resolvers to find the location of the document that the LSID identifies or if it is a concept you can retrieve machine readable metadata describing the concept. Furthermore, the name and the location are decoupled meaning that the names are location independent. And finally if you use the LSID specification the identifiers you mint and the documents/concepts they describe must be permanent and unchaining. The fatal flaw in this carefully laid plan is the need for wide spread adoption of LSIDs. This has not occurred in the many years since they have existed, and it is not for want of trying: they were backed by IBM :)

So if it is a reasonable technical solution to the perceived failings of the semantic web, why weren't they adopted ? Because apart from the two deal breakers I mentioned before, the rest are just what I said: perceived failures, not real ones. For example take names vs. locations, remembering the L in URL stands for locator, not location. URLs have the helpful property that they are names that also have a widely adopted infrastructure (DNS) in place that enables agents to resolve them to locations and fetch (HTTP) their representations. In other words a URL helps you locate the thing that it is naming, it doesn't mean that it is the location. Permanence is likewise only a perceived failing, no amount mandating permanence in a specification will make it so. Put simply it is a human thing. Organizations must be very stable and consistent in how they mint identifiers, usage of a specification cannot make an organization stable, only responsible people working in that organization can.

So what is the way forward ? The W3C must fix the first two problems, mainly by specifying a way to enable agents to get metadata about a URL identifier, before it tries to resolve it. They can do this by making metadata a first class citizen of the HTTP web. In other words HTTP must have an extra verb to get metadata about a URL an not just get the representation. Such a method, MGET is elegantly specified in URIQA by Patrick Stickler. The URIQA proposal has been around for quite some time now and like LSIDs has not had wide spread adoption. The argument against it is that millions of webservers around the world would need to be modified in-order to support this new method, making the whole thing a non-starter. This is why the W3C must get behind a solution, because at the end of the day the HCLS mailing list should be about discussing ways to use semantic web technology to improve the data integration situation, not arguing about the fundamental architecture of the semantic web itself. My opinion matters little, since I don't participate in any of these forums. Nonetheless, I conclude that the semantic web is broken, because the W3C has not made metadata a core part of the web and this is why I have lost interest in the semantic web.

[1] It could be an image too.
[2] There has recently been a resolution on this issue by the W3C, but the solution is still contentions.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Being able to resolve

Being able to resolve identifiers may be important for some applications (e.g. a Semantic Web crawler), but for others (e.g. a data integration system) I think it's often irrelevant. What is bothering for this kind of application, however, is when people generate their own URIs for third-party resources, even if they already have a perfectly good, official URI (unfortunately opinions differ on what is "perfectly good", it appears). Would be less of an issue if mappings (e.g. in the form of owl:sameAs statements) where provided. In fact, the ability to do so is what can make the Semantic Web work even when people can't agree on much (to be expected).

Note that in principal there already is a "standard" way to link to different representations: Just return a Web page by default, and include some link rel="alternate" type="..." elements in the header. And then there is also content negotiation... Something more easily and efficiently machine readable would be welcome, but shouldn't hold things back!


What is bothering for this

What is bothering for this kind of application, however, is when people generate their own URIs for third-party resources, even if they already have a perfectly good, official URI (unfortunately opinions differ on what is "perfectly good", it appears).

I can't see a distributed system not having this property i.e. two people identifying the same resource (concept) with two different URLs. Inverse functional properties, owl:sameAs go some way to solving this, but the problem won't go away and is something the semantic web needs to live with. In communities that have better organization (i.e. the life sciences community) there may be some hope of reaching agreements on the, but on the web at large it will be hard to stop, I agree.

The idea of link rel="alternate" doesn't quite work for me, also content negotiation. I take both to mean getting an alternate representation of the resource being identified. What you need is a mechanism to get metadata about a resource before you download the resource itself. Otherwise you have a solution that just does not scale. If a semantic web crawler has to download every resource it is interested in and then figure out what it is, a lot of redundant fetches will be made. If there is a mechanism in place for getting metadata about a resource, then resolving non-informational resources will never occur. As the semantic web agent will first request metadata about the resource, learn that it doesn't have a representation on the web, and move on.


The semantic web: will it all end in tiers?

You'll notice most semantic web software stacks (like the one below) have URI in them (not URL or URN)

 Will It All End In Tiers?

The likes of LSID and URIQA don't feature in these stacks at the moment, so we've got two interpretations of the semantic web: One where every URI is an HTTP gettable URL another where URIs are URNs. Which is the "right" one?


Don't know about which is

Don't know about which is the "right one", but "gettable" URLs seem to be the more simple and practical solution, for now.


HTTP gettable (or mgettable

HTTP gettable (or mgettable rather) is what they should be. But HTTP does not feature in that stack, and this is what annoys me about the whole issue. The semantic web mandarins have been intentionally vague about the implementation details. I get the feeling that they have gone for an abstract description of the stack on purpose, hoping that the correct implementation would follow i.e. URLs vs. URNs. Even if URLs win, which they seem to be doing, there is still the issue of accessing metadata describing the resource being identified by the URL. This is an implementation detail left out of the stack. So maybe your right, it will end in tiers, just more of them :)


To GET or not to GET. That's NOT the question...

If I've understood this correctly, the Dubya3C have deliberately used the more vague URI rather than URL in their specifications, because the web uses both URLs and URNs. There are times when you'll want to HTTP GET them and times when you don't.

If you just want to uniquely identify something, URNs are fine. As a colleague of mine once put it, "its just a f**king string".

For example the Uniform Resource Name URN:ISBN:0805080430 uniquely identifies a book by David Weinberger without any need for HTTP at all thank you very much.

If we want to identify a thing describing David's book the we can transform the URN into a URI http://www.amazon.co.uk/exec/obidos/ASIN/0805080430 or in a millon other ways. Why should a scheme for uniquely identifying things be so tightly coupled to HTTP (or anything else for that matter)?

Where your URIs are URLs that you can GET, I agree, it'd be handy to have a standard getMetadata method. But adding extra methods to HTTP seems like adding extra letters to the alphabet, it might have its uses, but its going to be "challenging" to implement successfully :)


There are times when you'll

There are times when you'll want to HTTP GET them and times when you don't.

Unless we are talking about something other than the semantic web then no, you always want to use HTTP GET, because if you're not using HTTP GET then we are not talking about the web. And we don't want to just GET the resources, we want to get metadata about the resources. A semantic web crawler cannot function without this mechanism in place. Now Eric Jain is correct, if you're only using URNs in the context of a custom data integration system independent of the web then go-for-your-life, invent as many naming and resolving schemes as you like. But in my opinion, the semantic web, if it ever does exist, will be part of the web. That is, the web as we all naturally understand it: the HTTP gettable web. And therefore we need a mechanism baked into the web for getting metadata about a resource.

The problem with URNs having their own schemes for turning them into a location is that, as you suggest, there are a million ways to do it. I am not interested in coding a semantic web crawler that needs to know specific conventions for turning every URN it encounters into a location and then figuring out a separate mechnism to find out what exactly is being identified.

The semantic web is all about the web. Thus the web must have metadata a first class citizen, end of story. Yes, you are completely correct, implementing a standard getMetadata method across the web will be "challenging", but this is the mission the W3C is there to pursue.

I mean, if it is not the web, it is just not cricket !


You can't always GET what you want

Lets use another example besides iSBN. The pizza ontology [1] uses several URIs, some of them you can GET, like the one which has some description and versioning information at http://www.co-ode.org/ontologies/pizza/2007/02/12/ and another identifying the OWL serialisation of the ontology at http://www.co-ode.org/ontologies/pizza/2007/02/12/pizza.owl

But it also uses these URIs which you can't GET http://www.co-ode.org/ontologies/pizza/pizza.owl, http://www.co-ode.org/ontologies/pizza/pizza.owl#SpicyPizza and http://www.co-ode.org/ontologies/pizza/pizza.owl#hasSpicines

Why should all these URIs be http gettable? Is this ontology NOT part of the web because we can't GET all the URIs it uses? I don't think so, because if you try sometimes you might find you GET all the URIs you need [2]. As for getting the metadata associated with these URIs, thats a seperate issue, I agree with you Greg. It'd be nice to have, but it might take a long time.

  1. Alan Rector, Nick Drummond, Matthew Horridge, Jeremy Rogers, Holger Knublauch, Robert Stevens, Hai Wang and Chris Wroe (2004) OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors and Common Patterns EKAW
  2. Mick Jagger and Keith Richards (1968) You can't always (HTTP) GET what you want

Is this ontology NOT part of

Is this ontology NOT part of the web because we can't GET all the URIs it uses?

As you suggest, there are two issues here: the resources themselves being gettable and metadata about the resources being gettable. In the case of resources that have no logical representation on the web (i.e. the Danube river, or p53), then no, they don't need to be gettable (but the metadata describing them does). In the case of an ontology description in OWL, yes they do. The side debate of hash vs. slash is important but not the main issue here.

A semantic web crawler needs to have a mechanism available to it, as a fundamental part of the web infrastructure, to get metdata about a resource, whether it be informational or non-informational (i.e. a concept with no logical representation or an HTML page). Since this does not exist, there is very little hope that the vision of the semantic web as being distributed data integration on the web will never be realized.


I just can't GET enough

OK, let's recap. If I've understood your arguments correctly, they are something like this:

  1. On the semantic web, you think all URIs should be URLs that you can HTTP GET. Names, like URNs and ISBNs, in your opinion, are NOT part of the semantic web because you can't HTTP GET them.
  2. Without an extra GetMetadata method added to HTTP, this whole idea of the semantic web is a complete non-starter.

The second proposition is interesting, which is why I just can't get enough, long after everyone else has lost interest in this thread :) Why can't the semantic web work with HTTP as it is? The web works OK with HTTP as it is so why can't the semantic web do the same? The current web isn't perfect, but it does work. People can still build and share ontologies on the web without changing HTTP. Reasoners (or "agents" if you prefer), are a key part of the semantic web, and they work perfectly well with HTTP as it is. So why do you think this shortcoming of HTTP is a show-stopper for the whole semantic web? Sorry if I'm banging on and on, but I find this discussion intriguing! I can see why GetMetadata is desirable, but I can't see why you think it's an absolute pre-requisite.


"... are not part of the

"... are not part of the semantic web because you cannot HTTP GET them." Not quite, because you can't HTTP GET metadata about them, or URLs for that matter. As I said, I'm thinking from the perspective of a semantic web crawler/agent. I mean, what is the semantic web, if not the web as we know it but for machines ? In other words, imagine what an autonomous agent would need to navigate and browser the web rather than a human... let's keep going then shall we.

My requirements are:

  1. A distinction between data and metadata.
  2. A universally accepted, scallable, protcol for agents to access both data and metadata.
  3. Data and metadata must be identifiable.

Two is probably inclusive of one, but we can refine these as we go.

At the moment the world wide web provides the identifier system (HTTP URIs, which includes naming, resolution and locating), and the protocol for accessing resources identified by URIs: HTTP. What we don't have is a universally accepted, scalable solution to accessing metadata about resources on the web. HTTP does not address this problem, as far as I know.

Of course you can layer any number of solutions on top of HTTP that will achieve roughly that goal (i.e. metadata about a resource). For example, standardize on an HTML linking scheme to point agents to metadata about the resource, transform HTML to RDF (GRDDL), or embed RDF/XML directly in the HTML. However, these solutions do not satisfy the scalability requirement. An agent will need to download each resource it wants to know about *before* it can know about it and it will have to know about each of these mechanisms. Second it does not satisfy the universally accepted requirement, it is likely that different organizations will adopt different standards for linking and/or embedding metadata in a webpage.

Alternatively you can just invent a whole new system for identification, location and resolution that includes a getMetaData facility. This would be LSIDs. Again, LSIDs fail the universally accepted requirement. It is now that I can say that I don't have any real problems with LSIDs themselves or any of the solutions I mentioned previously, it is just that I think the likelihood of adoption is low. This is also true of my preferred solution MGET (or URIQA). Without the W3C getting behind at least one of them and that solution also integrating easily into the existing web infrastructure, then yes, the whole proposition is a non-starter. Sure there will be isolated pockets of activity, but it will never take-off in the same way the HTTP web did.

At this point content negotiation, or embedding the metadata URL in a HEAD request are often touted as a solutions. Content negotiation doesn't deal with the metadata/data issue, only different representations of the data and HEAD requests are not scalable. For example, using content negotiation I may be able to have my agent accept application/rdf+xml, but it is doing this blindly. There is no way for it to know what it will get, other than its format will be RDF-XML. And even if both the server and the agent implement content negotiation, we are still back to figuring out what the resource is based on downloading the resource, rather than downloading authoritative metadata about the resource. Furthermore, some resources just don't have a logical RDF representations e.g. images, concepts etc. However metadata about those concepts are logically represented in RDF using descriptive ontologies.

To get around the "how do you know about a resource before you download it" problem, you can posit centralized google-like resources that can be queried (using SPARQL I guess) by agents for information about resources. It might work, but again, scalability, universally accepted. Also, how does a centralized database help me if I am the authority (owner of the domain identifying that resource) ? I want to provide my own metadata, not what google thinks it is about.

Thus, while HTTP is not the only game in town (think gopher, or ftp) when it comes to building a web of interlinked resources, it is the one most universally accepted. HTTP URIs are also not the only game in town for naming and locating resources (think LSIDs), but they are universally accepted. So HTTP is the web, HTTP URLs are the web, and if this is true, then HTTP need to be updated to facilitate access to metadata about resources by agents if it is to also be the semantic web.


Brief aside...

Duncan deserves some sort of punning prize, I feel.

Either than or a slap. Haven't decided. :)


Slap me if you've heard this one before

My prize should probably be a slap, for bastardising great works of english literature and music :)