[NTLUG:Discuss] Any "duplicate files" utilities?
Rick Renshaw
bofh69 at yahoo.com
Mon Apr 14 19:09:11 CDT 2008
--- Steve Baker <steve at sjbaker.org> wrote:
> So the big time crunch is running 'cksum' a third of a million times.
> OK - so how about this - first list only files that have a length equal
> to some other file....
>
> 'find -type f -size +0c -printf "%8s %p\\n" ' processes a third of a
> million files along with their lengths in 13 seconds (and skips empty
> files) - and piping that into 'sort -n | uniq --check-chars=8 -D' to the
> resulting file results in a list of files whose length equals the
> lengths of some other file - and runs in under 20 seconds.
>
> Running cksum on the result is still going to take a minute per 10,000
> files - but at least you've eliminated files that can't possibly have
> duplicates.
I wrote a similar program for my perl and Java classes back in college (I was taking them at the
same time, so a similar program in both languages saved design time), but it only checked one
directory. I checked the sizes first, also because if they're not the same size, no point in
wasting time running checksums. I did have one additional step to catch the occasional checksum
collisions, when the files matched in size and checksum, I also did an MD5 on the files and if
that was the same also, then I printed them out as duplicates.
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
More information about the Discuss
mailing list