Tumbled Logic

A ragtag blog filled with random technical nuggets, rants, raves, occasional pretty pictures, and links to things.

Oct 22

What weaves the Web?

In Ye Olde Days, before the World Wide Web burst onto the scene in a blaze of CERN-y glory, the Internet was doing quite well for itself.

Sure, it wasn’t the preserve of the average consumer, but it was useful, and it did a lot of the things we use the Internet for today: e-mail, instant messaging (well, okay, talk), finding out about people (finger), and sharing files (ftp). There was lots of other stuff too, of course.

The arrival of the Web brought with it a big change in how the Internet was used. Before the Web, by and large, the Internet was just an extension of the networks which already existed. This was pretty important (and by no means a bad thing), but it did mean that the sense of “out there on the Internet” — the idea of the Internet as almost a place itself — wasn’t really prevalent.

Then there was the Web.

The Web didn’t change things overnight, but it did help frame the power and flexibility of the Internet in terms ordinary people understood. People gradually stopped caring about connecting to computer systems (which could then access the Internet) and instead “connected to the Internet”, and the World Wide Web was instrumental in driving that change. “I can dial into my University and send e-mails and download files and stuff” was useful, but it didn’t have a “wow” factor for most people. “I can connect to the Internet and explore a universe of information”, on the other hand, was something people could understand.

Explore.

The Web removed the friction in travelling between documents and servers — it made exploration easy, which gave way to exploration being fun.

Those of you with long memories may recall that much of what makes the Web work pre-dated it: the dedicated client application, the seamless access to remote resources, the hyperlinked documents, even more-or-less standard mark-up systems which anybody with a text editor could produce documents using. The Web brought all of these things together, and added something into the mix: the Universal Resource Locator.

The URL (which later paved the way for the Universal Resource Indicator, or URI) is the bedrock of the World Wide Web. That world “Universal” there isn’t just for show. URLs made it possible to specify a link to a document — or even part of a document — using the same syntax irrespective of the protocol you needed to use to access it, which server hosted it, and whereabouts on that server it lived. Nowadays, most people don’t pay any attention to URLs, but they’re the piece of the puzzle which changed everything.

Before the URL, we’d say something like:

FTP to ftp.microsoft.com. Use ‘anonymous’ as the username and enter your e-mail address as the password then change to the /pub/kb directory and get kb90247.zip.

or

You can access the files using NFS if you mount phobos:/export/documents

or

You can get it via rcp — sol:/u/u127994/files.tar.gz

or

If you scroll down to Section 4, this is explained in detail.

In all of these cases, the URL embodies all of that information into a single kind of machine-understandable identifier. With the URL, people don’t need to think or have any special prior knowledge before jumping from A to B. You don’t need to know how a link works — your computer does that for you. With the URL, the Web made it so that the Internet was (albeit with some rough edges) something that ordinary people could understand the power and value of, rather than a disparate collection of scientists and academics creating a great big computer network.

Not long after the inception of the Web, people smarter than me started looking at ways to take that further. We have these documents (Web pages) designed to be presented to humans, and we put links in them which people can follow: what if you could put links in documents which were designed to be read by computers and also had links in that they could follow?

This isn’t as easy as it might sound: when people click on links, they’re making (most of the time!) a conscious decision based upon a whole load of contextual information. To you and I, clicking on a link in a section named “Categories:” is obvious.

And so there are ways to describe not just what the links are, but what the links are for — ways to describe relationships between documents, and the things discussed in those documents, and ways to tell computers which are reading documents what other documents out there on the Web are talking about the same thing, or similar things, or loosely related things.

We do this with the Resource Description Framework, or the somewhat snappier-sounding “RDF”.

One of the nicer aspects of the Web is that it’s built on HTTP, which stands for “Hypertext Transfer Protocol”. Fortunately (and I say “fortunately” because the modern Web would look very different if this wasn’t the case), HTTP is quite good at providing access to all kinds of resources, not just hypertext documents. Cleverly, it’s even able to serve different kinds of document to different people for the same URL, depending on what they ask for. That is, if an ordinary person types in a URL in a web browser they’ll be given a Web page, but if a program designed to consume RDF requests the same resource, it’ll get RDF back instead.

This is even more useful than it might first appear: it means there’s no need for special “interfaces” to things — getting a machine-readable document happens by the same process as getting a human-readable one. A brilliant example of this is the BBC Programmes site, which has pages for each show, series, episode and clip which the BBC has broadcast (or otherwise published) in the last few years. It’s very nice — for humans — and if you’re in the UK, at least, you’ll have probably come across it at some point.

But there’s a deeper picture here. If you take one of those programmes pages and request RDF instead, you’ll get RDF back. And that RDF will contain links to other things, which contain links to other things, and so on and so forth. The RDF you get out of the Programmes site contains all of the information shown on the web pages (actually, in some cases, there’s more!), but it’s structured in a way which can be understood by software. Even software which doesn’t know what an Episode is, for example, can understand enough of the information to know that it’s a thing of some sort which is related to a particular subject.

RDF isn’t a magic bullet, of course. There are lots of things that I’d much rather have as a straightforward CSV, for example, than RDF. Even so, RDF brings the promise of something tantalisingly close — the ability for software to make connections between things without needing specialist knowledge, and that paves the way for something even bigger: allowing software to take advantage of the power and flexibility of the Web in the same way that we do today.


  1. nevali posted this