Communications Data

The Internet is made of… layers. You may have thought it was tubes, or perhaps cats, but it is in fact layers. Like a cake. A lovely sprawling cake made of binary digits travelling from one place to another in milliseconds. Mmm. Cake.

And do you know what? This is grand. This layering is a big part of why innovative new applications for the Internet stack have been able to be developed rapidly by a hugely diverse set of people and organisations over the last few decades.

Where things become unstuck is where there’s an attempt to — in general terms — draw the line between “communications data” and “content”. The answer depends upon entirely what layer you happen to be examining — and because there are multiple layers inside one another, you can’t get the headers (“communications data”) about an inner layer without examining the payload (“content”) of the outer layer.

For example, let’s take SMTP — the protocol used to shift e-mail around the Internet. More specifically, let’s take SMTP, carried in a TCP session, over IPv4, on an Ethernet connection.

First up: Ethernet frames have their own header and payload structure. You can read all about it on Wikipedia. Is that a good enough header/payload split for you?

No? Okay, let’s try a little deeper.

The IPv4 packet structure has some pretty interesting information in there if you care deeply about who’s in communications with whom. Well, no. It has some pretty interesting information if you care deeply about which computers are communicating directly with which other computers. You don’t get any context or intent. You do get to find out the source and destination IPs and the IP protocol in use, though (was it TCP? UDP? ICMP? SCTP? something else?).

Okay, so we have to dig into “content” of both Ethernet and IPv4? Let’s look at TCP in that case. There are some other snippets of information which may be useful — source and destination ports, flags and checksums, sequencing information… though that’s only useful if you’re going to reassemble the “content” and examine that.

Oh? That’s not enough information?

This is a bit of a turning point: once you start digging into the higher-level protocols layered on top of TCP, you’re not dealing with just watching packets fly by and snarfing some useful bits of information — albeit at a volume which would make a storage vendor weep with joy.

Instead, you’re going to have to start buffering them, reassembling the payloads in the right order, and keep hold of enough of the stream of packets to get the information you want.

So, back to SMTP. One of many high-level protocols which are used on the Internet, and conveniently (for our “communications data” purposes) one of the few left which isn’t routinely encrypted between remote peers.

An SMTP session looks a lot like this — it’s a two-way exchange.

220 mailin0.th.myisp.com ESMTP Exim 4.72 Wed, 22 Aug 2012 18:11:52 +0100
EHLO mystupidcomputername.localnet
250-mailin0.th.myisp.com Hello host-92-22-62-242.as13285.net [92.22.62.242]
250-SIZE 52428800
250-PIPELINING
250 HELP
MAIL FROM: funkydude@myisp.com
250 OK
RCPT TO: davejones@myisp.com
250 Accepted
DATA

Aaaaand stop. We’ve got the sender (a Mr. funkydude), the recipient (a davejones), surely that’s all we need?

Wait, you want the subject as well? And other MIME headers? Well, we’ll have to keep reading, then.

From: F. Unkydude <funkydude@myisp.com>
To: Dave <davejones@myisp.com>, Alice <alice@mit.edu>, Bob <bob.shandwick@widgets.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1485\))
Subject: Re: Barbecue this weekend!
In-Reply-To: <CAGyDWzYrAB7cHn0aADBeF_1o=cEx9sFrgF0h6VxEiARrfrSqUQ@mail.myisp.com>
Date: Wed, 22 Aug 2012 17:36:23 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <D9E57DF5-7AF1-488A-9EE4-B08E3F1BF642@mystupidcomputername.localnet>
References: <CAGyDWzYrAB7cHn0aADBeF_1o=cEx9sFrgF0h6VxEiARrfrSqUQ@mail.myisp.com>
X-Mailer: Apple Mail (2.1485)

Aaaand stop!

But… wait, that wasn’t the message you were interested in? Hm, I guess we’d best just terminate the connection. We’ve already almost certainly read a bunch of the “content” of this communication, unless it just happened to land neatly on a packet boundary.

Except we can’t do that, because to keep things nice and efficient, many protocols — SMTP amongst them — use connection pipelining which means sending multiple messages down a single TCP/IP connection.

…which means that we have to keep reading, in case there’s another message which we do care about being sent after the one we don’t. Which we can’t do without processing the content, even if we promise solemnly not to keep any of it. Not that you’re allowed to confirm whether I do or not, of course, because it would breach commercial confidentiality. And in any case, I could just upgrade the units in the field anyway, changing how they work after you’ve inspected them to satisfy yourself that they’re not doing anything naughty.

Even with a messaging-focussed high-level protocol such as SMTP, differentiation between “communications data” and “content” is really hard to pin down. With the popularity of web-based messaging systems, things get a whole lot harder.

The difficulty isn’t really in not being able to capture the sorts of information which might be useful, but in defining — from a technical perspective — precisely what differentiates that information from anything else, and only capturing that.