[NTLUG:Discuss] Any "duplicate files" utilities?
Steve Baker
steve at sjbaker.org
Sun Apr 13 23:25:10 CDT 2008
Well, a single all-in-one program could beat it - but probably not by
much. Almost all of the time ought to be in opening every single file
and calculating a checksum for it. After that, the sorting and uniq'ing
is really quick.
On my rather ancient HP laptop (which has 300,000 files on it's hard drive):
* 'find -type f' lists a third of a million files along with their
lengths in 10 seconds.
* 'find -type f -exec cksum {} \;' takes 30 minutes to process the same
files.
...but doing a sort and uniq on the 300,000 line results file to find
the files with the same checksums only took another 15 seconds.
So the big time crunch is running 'cksum' a third of a million times.
OK - so how about this - first list only files that have a length equal
to some other file....
'find -type f -size +0c -printf "%8s %p\\n" ' processes a third of a
million files along with their lengths in 13 seconds (and skips empty
files) - and piping that into 'sort -n | uniq --check-chars=8 -D' to the
resulting file results in a list of files whose length equals the
lengths of some other file - and runs in under 20 seconds.
Running cksum on the result is still going to take a minute per 10,000
files - but at least you've eliminated files that can't possibly have
duplicates.
Daniel Hauck wrote:
> Thanks! I have found some utils and some were the ones listed below,
> but they seem to be so heavy on overhead that they don't actually seem
> to work on file collections as large as mine. I think these utilities
> aren't light enough...a script will likely do the job better.
>
> Steve Baker wrote:
>
>> I'm not going to write the script for you - but it's not difficult. In
>> outline: I would collect a list of filenames and checksums using
>> 'find'. It's convenient because you can set it's parameters to exclude
>> things you don't want to test - not look in cross-mounted file systems,
>> or in areas of the disk where system files live...etc. You can give it
>> an '-exec' parameter to run the 'cksum' tool on every file that it
>> finds. Do that right and you now have a L-O-N-G list of checksums and
>> corresponding filenames - you can sort by whichever column the checksums
>> ended up in using 'sort' and pipe that into 'uniq -d -sXX' to get a list
>> of just those files that have duplicated checksums. That gives you a
>> very short list of files that are ALMOST guaranteed to be identical...if
>> "ALMOST" is good enough then you're done - if not, use 'diff -s' to be
>> absolutely certain. I'm not sure how you indend to decide which of two
>> identical files to remove - but the list will probably be a short enough
>> one to deal with manually...or you could replace one of the files with a
>> link to the other file.
>>
>> geoffrey at justaweebitcloser.com wrote:
>>
>>>> I'm sure there must be at least half a dozen such utilities out there, I
>>>> just can't think of any for Linux.
>>>>
>>>> I seek a simple script or utility or whatever to find duplicate files on
>>>> media so that I can trim some redundancies out.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> http://www.ntlug.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>> A quick search for "duplicate" on Freshmeat resulted in the following:
>>>
>>> http://freshmeat.net/projects/fdmf/
>>> http://freshmeat.net/projects/freedup/
>>> http://freshmeat.net/projects/dupseek/
>>> http://freshmeat.net/projects/fdupes/
>>> http://freshmeat.net/projects/dupefinder/
>>> http://freshmeat.net/projects/duper/
>>>
>>> There were more than that, but these were on the front page.
>>>
>>> --
>>> Geoffrey
>>>
>>> _______________________________________________
>>> http://www.ntlug.org/mailman/listinfo/discuss
>>>
>>>
>> _______________________________________________
>> http://www.ntlug.org/mailman/listinfo/discuss
>>
>>
>
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
More information about the Discuss
mailing list