October 2011 – The Village Explainer

Okay, look: Gallows humour aside (for the moment), Steve Jobs doesn’t deserve our reverence. He deserves our respect, yes, for being one of the only people in the industry to actually think about how people used hardware. He was a great hardware designer in part because of his obsession with detail and his absolute inability to compromise on a principle.

I admire him for that. And I’m more than a little disgusted to hear about Jobs’ ‘visionary’ genius from the likes of Ballmer and Gates – who, not to put too fine a point on it, wouldn’t know a good design if it slapped them in the face with a dead salmon.

Who the fuck are they to judge? And who the fuck are we to listen?

No, the thing we need to admire about Jobs – the thing we need to LEARN about Steve Jobs – is how he thought, how he never stopped trying to make things simpler, how he utterly refused to compromise, how he refused to accept ‘improvement’ as the criterion for success. It was necessary, of course, and relentlessly pursued, but it was the means to another end….

And that was good design. Something the technological world knows far too little about. And with his passing, most of its collective knowledge and ability pass with him.

If you really want to show respect and admiration for Steve Jobs, understand him.

Emulate him. Let them call you arrogant and impolite if they must, but be a perfectionist. Be unforgiving, cruel even, to yourself and others. But be simple and clear, too. If you do that, then one day you might – just might – do one perfect thing.

I was stumped for a bit, trying to figure out how to help my editorial staff avoid uploading the same file twice. In a repository spanning tens of thousands of titles in over a hundred different collections, our staff can’t easily tell whether a document is already in a collection or not.

Turns out that finding duplicate attachments is fairly easy. First create the view:

function(doc) {
  if (doc._attachments){
    for (var i in doc._attachments){
      emit([doc.collection, i], doc._id);
    }
  }
}

Which returns JSON output that looks like this:

[“collection name”, “filename.rtf”]

So all I have to do to find the duplicates is query that view using the composite key and see if it returns any rows:

http://my.couchdb.server:5984/database-name/_design/my-listings/_view/attachment-exists?key=[“collection name”,”filename.rtf”]

I could do the same with MD5 checksums, too, but I won’t. The problem is that even a single character change is enough to make two documents different. So if someone opens their copy of a file and Word changes the metadata in it, it’s no longer byte-for-byte identical, even though the text has not changed. This means that the number of false negatives (i.e. duplicate files that are NOT found) would be too high for people to rely on.

What I’d really like to find is an algorithm that determines whether the textual content of two documents is significantly similar….

Month: October 2011

Find Duplicate File Names in CouchDB