There’s been a lot of discussion recently about the NSA eavesdropping programme, which reportedly has been surveilling US citizens without first getting a warrant. In one of these discussions, someone asked:
What’s the worst case scenario? How big could it be?
That’s a really good question. It occurs to me that no one has really attempted to address this yet in layman’s terms, so here goes….
The short answer is: ‘Big. Really really big. Think Google, then add a bit. But fairly simple, too.‘
Let’s perform a quick – and necessarily broad – assessment of the state of the art in Information Retrieval and data managment technologies, then try to tease out a few reasonable suppositions.
Before I go any further, I have to preface these comments with a few caveats:
- Everything that I’m about to write is based on supposition. I am not now, nor have I ever been, a member of any intelligence agency. I have no certain knowledge about any of the details below.
- I do have experience working in the field of knowledge management, Information Retrieval and natural language processing. I’ve written software applications that have been used to search very large language-based information resources, commonly called ‘corpuses’.
- Much of my professional work requires a fairly detailed knowledge of networking technologies and architectures and their inherent strengths and weaknesses.
- I’m no expert, but my work requires a fairly solid theoretical understanding of computer and network security.
Okay, full disclosure is complete. Let’s take a quick stroll through the state of computers, computer networks and the Internet today. We’ll start by looking at the public search engines.
Google, Yahoo!, Microsoft and About.com all have fairly large and fairly well understood search engines. While they differ in implementation, they all perform essentially the same task:
- Periodically (typically once a week or so) visit every public website on the Internet, and take a copy of every HTML (web) page and image file.
- Store all these files (and metadata, information about these files) in a very large repository, often referred to as a database, but more accurately called an index.
- Perform analysis of all the files and the metadata, and use that information to determine the relation of contents of these files to data contained in other files.
- Respond to requests for information (searches) entered by users of their system. These searches are all based on key words – important terms that the searcher thinks will occur more often in the content they’re interested in.
All of the steps described here are really simple to perform. A moderately talented programmer could write her own simple searchable index in a fairly short time. The analysis tools of each of the search services vary fairly widely, and are closely guarded trade secrets, but we know enough about them to draw some simple conclusions:
- Google uses a system called PageRank, which in its simplest form states, ‘A lot of other people link to this page using these key words. If others think this is useful, someone searching for the same keywords will likely find it useful, too.‘
- For quite some time, Yahoo! relied heavily on people to sift through the web pages its crawler found, and to categorise the content. It worked for a while, but the Web quickly got too big for this process. It’s reasonable to assume, however, that they applied at very least the same philosophy of broadly categorising their data in order to reduce the number of pages that have to be searched.
- All of the search engines rely on keyword searches.
Let’s look at this last point in more detail. Why don’t search engines use something other than keyword searches? Well, the answer is that because there are so many web pages around (billions and billions, as Carl Sagan used to say), that they need to keep the processes simple. There simply isn’t enough computing power in the world to cope with really elaborate processes for that much data.
You see, computers are fundamentally different from human brains. They’re restricted to what’s sometimes called classical logic, the kind of thing they teach you in freshman philosophy classes. Computers are really bad at doing things that human and most animals do really well: determining patterns from multiple sources, and contextualising the information extremely rapidly.
An example: When you see a slight movement out of the corner of your eye, your brain evaluates the data and within a few tenths of a second, your body has put you in a warning stance, you focus on the movement and try to determine if the cause of the movement requires a flight or fight reaction, or whether there’s no cause for alarm. If the moving thing turns out to be a black cat, you might:
- Sigh with relief and relax, because that just bootsie, your house cat.
- Feel vague alarm, because cats give you allergies.
- Ponder for a moment, trying to remember if any of your neighbours owns a black cat, or if you’ve seen any lost cat posters in the neighbourhood recently.
- Cross yourself and spit, because your grandma taught you that black cats are bad luck, and with the way your day has been going, more bad luck is the last thing you need.
As you can see, each of the reactions is derived from a wealth of prior input that your brain has received. The fact that the process described above typically takes no more than a second or two is, from a computer science perspective, nothing short of a miracle. Your brain has sorted through millions of data points and resolved a very complex problem in a fraction of the time it would take a computer.
Nobody quite knows how the brain manages this feat, but they do know that doing the same thing with a computer would require reviewing every single input ever received, and performing a costly (in terms of the number of instructions) comparison between it and the new input. Then it would have to permute all these comparisons to come to some kind of useful conclusion.
It’s not so hard to do if the number of inputs are limited, but doing it with a lot of apparently unassociated inputs quickly increases the ‘problem space’ – the number of possible answers. A simple game of chess involves so many possibilities that the big computer makers use the game as a way of proving whose computer is, uh, bigger, if you know what I mean.
So search engines like to keep things simple. They’ve got billions of data points to consider, so they can’t really afford to do much with each of them. The thing that makes them really impressive is that fact that they can do anything useful at all with the volume of information that they deal with every day.
Now let’s consider the NSA. These guys are really good. They hire the most talented mathematicians, computer scientists and academic researchers they can find. They reward them by giving them the best toys – I mean, tools that money can buy, and they motivate them by telling them they’re doing the single most important thing an American can do: Keep the country safe by keeping it one step ahead of the Bad Guys.
Is it reasonable to think that the NSA could surveil the entire Internet, the same way Google and co. do? Sure it is. Google runs most of its operations on beefed-up PCs, not very different from the one I’m using to write this diary. The NSA has access to hardware and software that Google can only dream about. It’s reasonable to assume that they can do a much more efficient job than the public search engines.
But there’s a lot more data than just the web to think about. The number of email messages, telephone calls and live computer links absolutely dwarfs the web, especially because it’s all happening in real time.
So let’s assume that the NSA, in order to keep up with it all, has to be a few times the size of Google. Is that a reasonable assumption? Just barely. Published estimates on the number of computers Google uses to manage its data run from between 60-100,000 machines, spread out over a number of centres world-wide. They take up a lot of space. But that’s okay, if Google needs more elbow roompace, they just cut another check and pay a large service provider to give them more space in their server room.
The NSA on the other hand has to work in secrecy, which means that they can’t plunk servers down all over heck’s half acre. They do have very large establishments to work in, but they’re constrained by the fact that the information has to be fed into a limited number of locations. Switches these days can send tens of billions of bytes of data a second, but the volume that the NSA might want to look at could easily choke even the biggest pipe.
How big then, could the NSA reasonably be? Bigger than Google for sure, with the capacity to perform really simple searches on a significant percentage of the world’s communications. And there’s no reason to believe they’re not doing this. Remember: It’s the NSA’s job to keep an eye on the rest of the world. The only place where they face restrictions is when they listen to people in the US.
Hang on a second – If the NSA operation is so big, then how come we know next to nothing about it? Google and the other search engines can’t visit my website without leaving a trace; how does the NSA, which could easily be much bigger (and busier!) do all this without leaving a trace?
That’s a whole ‘nother article right there. Thankfully, the answer to that has been really well addressed by the ACLU in their article on how the NSA gathers its data.
But there’s another kind of ‘Big’ that needs to be considered: How much can the NSA reasonably do with each bit of information they receive?
The answer to that is a lot tougher. Remember: The NSA is not just running a search engine; it’s trying to recognise potential threats from a bunch of different places – Internet, telephone, satellite and a bunch more. Just like the person above, it’s trying to keep an eye on the corners and watch for suspicious movement.
How effective are they? The answer to that’s bound up in some variables that we can’t know the value of. We can make a few conservative assumptions, though.
- Given enough raw computing power, there a very few cyphers that can’t be broken. So if the NSA is interested enough, they can probably listen to anyone they want to, even if you strongly encrypt your data.
- Given the scale of their operations, they can probably intercept anything they want, but maybe not process it in any depth.
It pretty much comes down to this: You can surveil all of the people some of the time, and you can surveil some of the people all of the time. But you can’t surveil all of the people all of the time. Not even if you’re the NSA.
And this, folks, is the kicker. What the warrantless wiretapping programme seems to consist of is a data mining operation – something that looks a lot like what the public search engines do, only the key words that are being searched for are much more limited. It’s kind of like Google, only with a ‘TerroristRank’ system, instead of PageRank.
It’s reasonable to assume that it would work something like this:
- Use Google-ish key word- and pattern-based methods to winnow out the wheat from the chaff. Most of LiveJournal and MySpace can quickly be discarded, for example – at least until they embark on the War On Angst.
- Perform more complex, ‘human like’ analysis on the small percentage of content that passes the first key word hurdle.
- Look for associations between that material and other known dangerous locations and content.
- Pass the last few records on to human beings, who are still vastly more efficient at pattern matching and contextualisation.
So how much material actually gets looked at (or listened to) by human beings? Likely a very small percentage. Unfortunately, that’s a very small percentage of a very big number. Worse still, if someone decided to game the system, or change that key word or pattern list to include questionable terms, then it’s entirely possible that a large number of the wrong people could be investigated at a distance (or up close and personal) by human beings.
And this, according to a recent story in the New York Times, is what the folks in FBI and other domestic agencies are upset about. Apparently, they’re being asked to investigate a lot of people for the wrong reasons.
Like I said earlier, it’s not unreasonable to assume that the NSA operation is very large, and in some regards very simple. If the wrong information is drawn from it, that would make it very large… and very stupid, in that unique way that only computers can be stupid. Dangerously stupid, for the people who come under its watchful eye.
There’s one last question that needs to be asked: ‘Okay, so a lot of data is being searched. Is it reasonable to believe that my data is being searched?
The answer to that is yes and no. Are machines likely to be processing your data? Probably, if only to make sure that you’re not worth looking at. Are humans likely to be looking at your data? That depends on what’s being fed into the key word list. The way the system is designed, the answer should be an unequivocal No. Based on what little information is leaking out, however, we can’t be sure of that, and we have some reason to believe that the answer may be yes.