[NTLUG:Discuss] Any "duplicate files" utilities?

Steve Baker steve at sjbaker.org
Sun Apr 13 23:25:10 CDT 2008


Well, a single all-in-one program could beat it - but probably not by 
much.  Almost all of the time ought to be in opening every single file 
and calculating a checksum for it.  After that, the sorting and uniq'ing 
is really quick.

On my rather ancient HP laptop (which has 300,000 files on it's hard drive):

* 'find -type f' lists a third of a million files along with their 
lengths in 10 seconds.
* 'find -type f -exec cksum {} \;'  takes 30 minutes to process the same 
files.

...but doing a sort and uniq on the 300,000 line results file to find 
the files with the same checksums only took another 15 seconds.

So the big time crunch is running 'cksum' a third of a million times.   
OK - so how about this - first list only files that have a length equal 
to some other file....

'find -type f -size +0c -printf "%8s %p\\n" ' processes a third of a 
million files along with their lengths in 13 seconds (and skips empty 
files) - and piping that into 'sort -n | uniq --check-chars=8 -D' to the 
resulting file results in a list of files whose length equals the 
lengths of some other file - and runs in under 20 seconds.

Running cksum on the result is still going to take a minute per 10,000 
files - but at least you've eliminated files that can't possibly have 
duplicates.

Daniel Hauck wrote:
> Thanks!  I have found some utils and some were the ones listed below,
> but they seem to be so heavy on overhead that they don't actually seem
> to work on file collections as large as mine.  I think these utilities
> aren't light enough...a script will likely do the job better.
>
> Steve Baker wrote:
>   
>> I'm not going to write the script for you - but it's not difficult.  In 
>> outline: I would collect a list of filenames and checksums using 
>> 'find'.   It's convenient because you can set it's parameters to exclude 
>> things you don't want to test - not look in cross-mounted file systems, 
>> or in areas of the disk where system files live...etc.   You can give it 
>> an '-exec' parameter to run the 'cksum' tool on every file that it 
>> finds.   Do that right and you now have a L-O-N-G list of checksums and 
>> corresponding filenames - you can sort by whichever column the checksums 
>> ended up in using 'sort' and pipe that into 'uniq -d -sXX' to get a list 
>> of just those files that have duplicated checksums.  That gives you a 
>> very short list of files that are ALMOST guaranteed to be identical...if 
>> "ALMOST" is good enough then you're done - if not, use 'diff -s' to be 
>> absolutely certain.  I'm not sure how you indend to decide which of two 
>> identical files to remove - but the list will probably be a short enough 
>> one to deal with manually...or you could replace one of the files with a 
>> link to the other file.
>>
>> geoffrey at justaweebitcloser.com wrote:
>>     
>>>> I'm sure there must be at least half a dozen such utilities out there, I
>>>> just can't think of any for Linux.
>>>>
>>>> I seek a simple script or utility or whatever to find duplicate files on
>>>> media so that I can trim some redundancies out.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> http://www.ntlug.org/mailman/listinfo/discuss
>>>>
>>>>     
>>>>         
>>> A quick search for "duplicate" on Freshmeat resulted in the following:
>>>
>>> http://freshmeat.net/projects/fdmf/
>>> http://freshmeat.net/projects/freedup/
>>> http://freshmeat.net/projects/dupseek/
>>> http://freshmeat.net/projects/fdupes/
>>> http://freshmeat.net/projects/dupefinder/
>>> http://freshmeat.net/projects/duper/
>>>
>>> There were more than that, but these were on the front page.
>>>
>>> --
>>> Geoffrey
>>>
>>> _______________________________________________
>>> http://www.ntlug.org/mailman/listinfo/discuss
>>>   
>>>       
>> _______________________________________________
>> http://www.ntlug.org/mailman/listinfo/discuss
>>
>>     
>
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>   




More information about the Discuss mailing list