[NTLUG:Discuss] Any "duplicate files" utilities?

Mon Apr 14 19:09:11 CDT 2008

--- Steve Baker <steve at sjbaker.org> wrote:

> So the big time crunch is running 'cksum' a third of a million times.   
> OK - so how about this - first list only files that have a length equal 
> to some other file....
> 
> 'find -type f -size +0c -printf "%8s %p\\n" ' processes a third of a 
> million files along with their lengths in 13 seconds (and skips empty 
> files) - and piping that into 'sort -n | uniq --check-chars=8 -D' to the 
> resulting file results in a list of files whose length equals the 
> lengths of some other file - and runs in under 20 seconds.
> 
> Running cksum on the result is still going to take a minute per 10,000 
> files - but at least you've eliminated files that can't possibly have 
> duplicates.

I wrote a similar program for my perl and Java classes back in college (I was taking them at the
same time, so a similar program in both languages saved design time), but it only checked one
directory.  I checked the sizes first, also because if they're not the same size, no point in
wasting time running checksums.  I did have one additional step to catch the occasional checksum
collisions, when the files matched in size and checksum, I also did an MD5 on the files and if
that was the same also, then I printed them out as duplicates.

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ