[NTLUG:Discuss] Simple file dedupe script

Sun Jul 18 11:13:15 CDT 2010

On Sat, Jul 17, 2010 at 11:23:06PM -0500, Chris Cox wrote:
> I was answering a forum post where somebody wanted to find all duplicate
> images within a directory.  I had always wanted to write a file dedupe script...
> so here's my take:
> 
> http://endlessnow.com/ten/Source/dedupe-sh.txt

Impressive.  This is almost exactly how I've planned to solve this
problem in the past (size, inode, md5sum, cmp), I never realized it
could be done in a single pipe.

I have disks with 7 digit inode #'s.  So, I would change -D 6 to -D 7
(on 6 digit inode it would compare the trailing space.

The expense is 99% is md5sum and cmp, so decreasing the invalid size
matches is very important.

Since file names could have spaces, your cut command could truncate file names.

I'd replace:

sed 's/^[ ][ ]*//' | cut -f2- -d' '

with 

sed 's/^[ ][ ]*[0-9]*[ ]*//'

> Feel free to modify, etc.
> 
> The result requires the final step... that is, making SURE they truly are
> duplicates (likely) and then ... well... what to do with the information.
> 
> Enjoy,
> Chris

Thank you!  it's already running on 1.1 TB / 7 million inodes (45M
directory entries).

My modified version is at http://gist.github.com/480511 .  It includes
all the original text, and a pointer to your location for the file.

-- 

Wayne Walker
wwalker at solid-constructs.com
(512) 633-8076
Senior Consultant
Solid Constructs, LLC