Tumbled Logic

A ragtag blog filled with random technical nuggets, rants, raves, occasional pretty pictures, and links to things.

Jun 22

Hyperdata

Because, even now, it seems that some people seem to have difficulty with “Linked Data” meaning “using URIs to identify things which can be dereferenced to retrieve machine-readable information without prior knowledge of URI structure”, I’m hereby going to use Hyperdata (a parallel to hypertext) to refer to just that. Essentially, data with hyperlinks.

(The term “Hyperdata” was coined by Nova Spivack in 2007. Update: Kingsley Idehen points me at an even earlier reference which notes the term’s age).

Hypertext is a bit simpler that hyperdata, because hypertext is purely about “links between documents”. In contrast, hyperdata is about “references to other things, and a shared convention to publish the information about those things at the referenced locations”.

This is best illustrated with an example.

Many people know that all BBC programmes are assigned a programme identifier, or PID. These PIDs are alphanumeric, and begin with a letter, followed by a number of letters and digits. b0120z2z is the PID for the episode “Luther, Series 2, Episode 1”.

If I give you some data, and it has a field containing b0120z2z, you can’t do anything useful at all with that information without knowing in advance that it’s a BBC PID. I might tell you that it’s a BBC PID, perhaps by calling the field bbc_pid, but you still need to tell your software what that means and what to do with it.

However, BBC programmes have a second, and arguably more important, identifier (at least, outside of the corporation and its partners): they have a URI, too.

To make life a little easier for humans, BBC programmes’ URIs contain the PID in a predictable and well-defined way, and so I can tell you without looking that the URI for b0120z2z is in fact http://www.bbc.co.uk/programmes/b0120z2z#programme.

To be clear, this isn’t the identifier of a bit of data — this is the identifier used to refer to the episode itself. These are the machine-interpretable names of the episode.

Now, if I give you some data, and instead of that field containing b0120z2z, it instead contains http://www.bbc.co.uk/programmes/b0120z2z#programme, you can do something useful with that. Your software can look at that URI and see that it’s within a scheme which it can dereferencehttp. Your software can fetch http://www.bbc.co.uk/programmes/b0120z2z#programme and, if all is well, retrieve more data, this time the data about that episode.

The way that this works is this. When an HTTP URI such as http://www.bbc.co.uk/programmes/b0120z2z#programme is dereferenced (usually by way of a library such as Curl), this is turned into a request for /programmes/b0120z2z at the www.bbc.co.uk web server (setting the Accept request header appropriately to include the MIME types of the formats of data you can deal with). Note that the fragment identifier (the #programme) isn’t part of that request, because HTTP is all about plain old documents, rather than things (in our case, episodes).

In response to this request, /programmes — if it is able — returns a machine-readable document in some format (for example, JSON, or RDF/XML) which describes the episode in question — the one named http://www.bbc.co.uk/programmes/b0120z2z#programme.

Indeed, if you send the appropriate Accept header to retrieve an RDF document, what you’d get back would be a document with a particular section labelled as being “about” /programmes/b0120z2z#programme (which, if you expand to a fully-qualified URI, leads you the http://www.bbc.co.uk/programmes/b0120z2z#programme that you wanted).

This is what it boils down to: the document at /programmes/b0120z2z contains a section which describes the episode named /programmes/b0120z2z#programme.

Of course, there’s nothing stopping you having a single document describing lots of different things, all with their own distinct identifiers — all you need to do is keep the “path” portion of the URI the same and create different fragments for each.

In the Luther example, http://www.bbc.co.uk/programmes/b0120z2x#programme is the URI for one of the versions of this episode of Luther (don’t worry too much about what a “version” really means, it’s just an aspect of how things are prepared and arranged for broadcast), while the first broadcast of that version has the URI http://www.bbc.co.uk/programmes/b0120z2x#p00hdssq. If you attempt to dereference either, you would get exactly the same document back (because the request to the web server would be identical — just /programmes/b0120z2x), but different sections within it contain the information about the version as opposed to the broadcast.

And so, this is Hyperdata. Data with links. You can call it “Linked Data”. You can call it “the Semantic Web”.

But, as I keep seeing things which don’t actually have hyperlinks being called “Linked Data”, and lots of ultratheoretical noise about what does and doesn’t constitute “the Semantic Web”, I grow weary of the ambiguity. “Hyperdata”, as far as I know, isn’t ambiguous.

It’s also got “hyper” in the name, so it must be cool.


  1. nevali posted this