{"id":37,"date":"2006-02-07T09:51:52","date_gmt":"2006-02-06T22:51:52","guid":{"rendered":"http:\/\/scriptorum.imagicity.com\/2006\/02\/07\/nsa-for-dummies\/"},"modified":"2006-02-07T09:51:52","modified_gmt":"2006-02-06T22:51:52","slug":"nsa-for-dummies","status":"publish","type":"post","link":"https:\/\/village-explainer.kabisan.com\/index.php\/2006\/02\/07\/nsa-for-dummies\/","title":{"rendered":"NSA for Dummies"},"content":{"rendered":"<p>There&#8217;s been a lot of discussion recently about the NSA eavesdropping programme, which reportedly has been surveilling US citizens without first getting a warrant. In one of these discussions, someone asked:<\/p>\n<blockquote><p>What&#8217;s the worst case scenario? How big could it be?<\/p><\/blockquote>\n<p>That&#8217;s a really good question. It occurs to me that no one has really  attempted to address this yet in layman&#8217;s terms, so here goes&#8230;.<\/p>\n<p><!--more--><\/p>\n<p>The short answer is: &#8216;<strong>Big. Really really big. Think Google, then add a bit. But fairly simple, too.<\/strong>&#8216;<\/p>\n<p>Let&#8217;s perform a quick &#8211; and necessarily broad &#8211; assessment of the state of the art in Information Retrieval and data managment  technologies, then try to tease out a few reasonable suppositions.<\/p>\n<p>Before I go any further, I have to preface these comments with a  few caveats:<\/p>\n<ul>\n<li> Everything that I&#8217;m about to write is based on supposition. I am  not now, nor have I ever been, a member of any intelligence agency.  I have no certain knowledge about any of the details below.<\/li>\n<li> I do have experience working in the field of knowledge management, Information Retrieval and natural language processing.  I&#8217;ve written software applications that have been used to search very  large language-based information resources, commonly called &#8216;corpuses&#8217;.<\/li>\n<li> Much of my professional work requires a fairly detailed knowledge  of networking technologies and architectures and their inherent strengths and weaknesses.<\/li>\n<li> I&#8217;m no expert, but my work requires a fairly solid theoretical  understanding of computer and network security.<\/li>\n<\/ul>\n<p>Okay, full disclosure is complete. Let&#8217;s take a quick stroll through the state of computers, computer networks and the Internet  today. We&#8217;ll start by looking at the public search engines.<\/p>\n<p>Google, Yahoo!, Microsoft and About.com all have fairly large and fairly well understood search engines. While they differ in implementation, they all perform essentially the same task:<\/p>\n<ul>\n<li> Periodically (typically once a week or so) visit every public website on the Internet, and take a copy of every HTML (web) page  and image file.<\/li>\n<li> Store all these files (and <em>metadata<\/em>, information about  these files) in a very large repository, often referred to as a  database, but more accurately called an index.<\/li>\n<li> Perform analysis of all the files and the metadata, and use  that information to determine the relation of contents of these  files to data contained in other files.<\/li>\n<li> Respond to requests for information (searches) entered by users  of their system. These searches are all based on <em>key words<\/em>  &#8211; important terms that the searcher thinks will occur more often  in the content they&#8217;re interested in.<\/li>\n<\/ul>\n<p>All of the steps described here are really simple to perform.  A moderately talented programmer could write her own simple searchable  index in a fairly short time. The analysis tools of each of the search  services vary fairly widely, and are closely guarded trade secrets, but we know enough about them to draw some simple conclusions:<\/p>\n<ul>\n<li> Google uses a system called PageRank, which in its simplest form  states, &#8216;<em>A lot of other people link to this page using these key words. If others think this is useful, someone searching for the same  keywords will likely find it useful, too.<\/em>&#8216;<\/li>\n<li> For quite some time, Yahoo! relied heavily on people to sift  through the web pages its crawler found, and to categorise the  content. It worked for a while, but the Web quickly got too big for  this process. It&#8217;s reasonable to assume, however, that they applied  at very least the same philosophy of broadly categorising their data  in order to reduce the number of pages that have to be searched.<\/li>\n<li> All of the search engines rely on keyword searches.<\/li>\n<\/ul>\n<p>Let&#8217;s look at this last point in more detail. Why don&#8217;t search engines use something other than keyword searches? Well, the answer  is that because there are so many web pages around (billions and  billions, as Carl Sagan used to say), that they need to keep the  processes simple. There simply isn&#8217;t enough computing power in the  world to cope with really elaborate processes for that much data.<\/p>\n<p>You see, computers are fundamentally different from human brains. They&#8217;re restricted to what&#8217;s sometimes called classical logic,  the kind of thing they teach you in freshman philosophy classes. Computers are really bad at doing things that human and most  animals do really well: determining patterns from multiple sources, and  contextualising the information extremely rapidly.<\/p>\n<p>An example: When you see a slight movement out of the corner of your eye, your brain evaluates the data and within a few tenths of a second, your body has put you in a warning stance, you focus on the  movement and try to determine if the cause of the movement requires a  flight or fight reaction, or whether there&#8217;s no cause for alarm. If  the moving thing turns out to be a black cat, you might:<\/p>\n<ul>\n<li> Sigh with relief and relax, because that just bootsie, your house cat.<\/li>\n<li> Feel vague alarm, because cats give you allergies.<\/li>\n<li> Ponder for a moment, trying to remember if any of your neighbours  owns a black cat, or if you&#8217;ve seen any lost cat posters in the  neighbourhood recently.<\/li>\n<li> Cross yourself and spit, because your grandma taught you that black cats are bad luck, and with the way your day has been going,  more bad luck is the last thing you need.<\/li>\n<\/ul>\n<p>As you can see, each of the reactions is derived from a wealth of  prior input that your brain has received. The fact that the process  described above typically takes no more than a second or two is, from  a computer science perspective, nothing short of a miracle. Your brain  has sorted through millions of data points and resolved a very complex  problem in a fraction of the time it would take a computer.<\/p>\n<p>Nobody quite knows how the brain manages this feat, but they do know that doing the same thing with a computer would require reviewing  every single input ever received, and performing a costly (in terms of  the number of instructions) comparison between it and the new input.  Then it would have to permute all these comparisons to come to some  kind of useful conclusion.<\/p>\n<p>It&#8217;s not so hard to do if the number of inputs are limited, but doing  it with a lot of apparently unassociated inputs quickly increases the  &#8216;problem space&#8217; &#8211; the number of possible answers. A simple game of chess involves so many possibilities that the big computer makers  use the game as a way of proving whose computer is, uh, bigger, if  you know what I mean.<\/p>\n<p>So search engines like to keep things simple. They&#8217;ve got billions of  data points to consider, so they can&#8217;t really afford to do much with  each of them. The thing that makes them really impressive is that fact  that they can do anything useful at all with the volume of information  that they deal with every day.<\/p>\n<p>Now let&#8217;s consider the NSA. These guys are really good. They hire the  most talented mathematicians, computer scientists and academic researchers they can find. They reward them by giving them the best  toys &#8211; I mean, <em>tools<\/em> that money can buy, and they motivate them  by telling them they&#8217;re doing the single most important thing an American can do: Keep the country safe by keeping it one step ahead  of the Bad Guys.<\/p>\n<p>Is it reasonable to think that the NSA could surveil the entire  Internet, the same way Google and co. do? Sure it is. Google runs most  of its operations on beefed-up PCs, not very different from the one I&#8217;m using to write this diary. The NSA has access to hardware and  software that Google can only dream about. It&#8217;s reasonable to assume  that they can do a much more efficient job than the public search  engines.<\/p>\n<p>But there&#8217;s a lot more data than just the web to think about. The number of email messages, telephone calls and live computer links  absolutely dwarfs the web, especially because it&#8217;s all happening in  real time.<\/p>\n<p>So let&#8217;s assume that the NSA, in order to keep up with it all, has to  be a few times the size of Google. Is that a reasonable assumption?  Just barely. Published estimates on the number of computers Google  uses to manage its data run from between 60-100,000 machines, spread  out over a number of centres world-wide. They take up a lot of space.  But that&#8217;s okay, if Google needs more elbow roompace, they just cut  another check and pay a large service provider to give them more space  in their server room.<\/p>\n<p>The NSA on the other hand has to work in secrecy, which means that they can&#8217;t plunk servers down all over heck&#8217;s half acre. They do have  very large establishments to work in, but they&#8217;re constrained by the  fact that the information has to be fed into a limited number of  locations. Switches these days can send tens of billions of bytes of  data a second, but the volume that the NSA might want to look at could  easily choke even the biggest pipe.<\/p>\n<p>How big then, could the NSA reasonably be? Bigger than Google for sure, with the capacity to perform really simple searches on a  significant percentage of the world&#8217;s communications. And there&#8217;s  no reason to believe they&#8217;re not doing this. Remember: It&#8217;s the NSA&#8217;s  job to keep an eye on the rest of the world. The only place where they  face restrictions is when they listen to people in the US.<\/p>\n<p><strong>Hang on a second &#8211; If the NSA operation is so big, then how come we know next to nothing about it? Google and the other search  engines can&#8217;t visit my website without leaving a trace; how does the  NSA, which could easily be much bigger (and busier!) do all this  without leaving a trace?<\/strong><\/p>\n<p>That&#8217;s a whole &#8216;nother article right there. Thankfully, the answer to  that has been really well addressed by the ACLU in  <a href=\"http:\/\/www.aclu.org\/safefree\/nsaspying\/23989res20060131.html\">their  article on how the NSA gathers its data<\/a>.<\/p>\n<p>But there&#8217;s another kind of &#8216;Big&#8217; that needs to be considered:  <strong>How much can the NSA reasonably do with each bit of  information they receive?<\/strong><\/p>\n<p>The answer to that is a lot tougher. Remember: The NSA is not just  running a search engine; it&#8217;s trying to recognise potential threats from a bunch of different places &#8211; Internet, telephone, satellite and a bunch more. Just like the person above, it&#8217;s trying to keep an eye  on the corners and watch for suspicious movement.<\/p>\n<p>How effective are they? The answer to that&#8217;s bound up in some variables that we can&#8217;t know the value of. We can make a few  conservative assumptions, though.<\/p>\n<ul>\n<li> Given enough raw computing power, there a very few cyphers that  can&#8217;t be broken. So if the NSA is interested enough, they can probably  listen to anyone they want to, even if you strongly encrypt your data.<\/li>\n<li> Given the scale of their operations, they can probably intercept  anything they want, but maybe not process it in any depth.<\/li>\n<\/ul>\n<p>It pretty much comes down to this: <strong>You can surveil all of the people some of the time, and you can surveil some of the people  all of the time. But you can&#8217;t surveil all of the people all of the  time. Not even if you&#8217;re the NSA.<\/strong><\/p>\n<p>And this, folks, is the kicker. What the warrantless wiretapping  programme seems to consist of is a data mining operation &#8211; something  that looks a lot like what the public search engines do, only the key  words that are being searched for are much more limited. It&#8217;s kind of  like Google, only with a &#8216;TerroristRank&#8217; system, instead of PageRank.<\/p>\n<p>It&#8217;s reasonable to assume that it would work something like this:<\/p>\n<ul>\n<li> Use Google-ish key word- and pattern-based methods to winnow out  the wheat from the chaff. Most of LiveJournal and MySpace can quickly  be discarded, for example &#8211; at least until they embark on the War On  Angst.<\/li>\n<li> Perform more complex, &#8216;human like&#8217; analysis on the small percentage of content that passes the first key word hurdle.<\/li>\n<li> Look for associations between that material and other known  dangerous locations and content.<\/li>\n<li> Pass the last few records on to human beings, who are still  vastly more efficient at pattern matching and contextualisation.<\/li>\n<\/ul>\n<p>So how much material actually gets looked at (or listened to) by  human beings? Likely a <strong>very<\/strong> small percentage.  Unfortunately, that&#8217;s a very small percentage of a very big number.  Worse still, if someone decided to game the system, or change that  key word or pattern list to include questionable terms, then it&#8217;s  entirely possible that a large number of the <strong>wrong  people<\/strong> could be investigated at a distance (or up close  and personal) by human beings.<\/p>\n<p>And this, according to a recent story in the New York Times, is what  the folks in FBI and other domestic agencies are upset about.  Apparently, they&#8217;re being asked to investigate a lot of people  for the wrong reasons.<\/p>\n<p>Like I said earlier, it&#8217;s not unreasonable to assume that the NSA  operation is very large, and in some regards very simple. If the wrong  information is drawn from it, that would make it very large&#8230; and very stupid, in that unique way that only computers can be stupid.  Dangerously stupid, for the people who come under its watchful eye.<\/p>\n<p>There&#8217;s one last question that needs to be asked: <strong>&#8216;Okay, so  a lot of data is being searched. Is it reasonable to believe that  <em>my<\/em> data is being searched?<\/strong><\/p>\n<p>The answer to that is yes and no. Are machines likely to be processing  your data? Probably, if only to make sure that you&#8217;re not worth looking at. Are humans likely to be looking at your data? That depends  on what&#8217;s being fed into the key word list. The way the system is  designed, the answer should be an unequivocal No. Based on what little  information is leaking out, however, we can&#8217;t be sure of that, and we  have some reason to believe that the answer may be yes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There&#8217;s been a lot of discussion recently about the NSA eavesdropping programme, which reportedly has been surveilling US citizens without first getting a warrant. In one of these discussions, someone asked: What&#8217;s the worst case scenario? How big could it be? That&#8217;s a really good question. It occurs to me that no one has really [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,5,10,12],"tags":[278,407],"class_list":["post-37","post","type-post","status-publish","format-standard","hentry","category-geek","category-journamalism","category-soft-core","category-wonk","tag-information-retrieval","tag-nsa"],"_links":{"self":[{"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/posts\/37","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/comments?post=37"}],"version-history":[{"count":0,"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/posts\/37\/revisions"}],"wp:attachment":[{"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/media?parent=37"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/categories?post=37"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/village-explainer.kabisan.com\/index.php\/wp-json\/wp\/v2\/tags?post=37"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}