[NTLUG:Discuss] Simple file dedupe script

Mon Jul 19 10:30:28 CDT 2010

On Sun, 2010-07-18 at 11:13 -0500, Wayne Walker wrote:
> On Sat, Jul 17, 2010 at 11:23:06PM -0500, Chris Cox wrote:
> > I was answering a forum post where somebody wanted to find all duplicate
> > images within a directory.  I had always wanted to write a file dedupe script...
> > so here's my take:
> > 
> > http://endlessnow.com/ten/Source/dedupe-sh.txt
> 
> Impressive.  This is almost exactly how I've planned to solve this
> problem in the past (size, inode, md5sum, cmp), I never realized it
> could be done in a single pipe.
> 
> I have disks with 7 digit inode #'s.  So, I would change -D 6 to -D 7
> (on 6 digit inode it would compare the trailing space.
> 
> The expense is 99% is md5sum and cmp, so decreasing the invalid size
> matches is very important.
> 
> Since file names could have spaces, your cut command could truncate file names.
> 
> I'd replace:
> 
> sed 's/^[ ][ ]*//' | cut -f2- -d' '
> 
> with 
> 
> sed 's/^[ ][ ]*[0-9]*[ ]*//'
> 
Well.. there are some errors in my logic with regards to the sorts, and
such.  Unfortunately, the tools don't output fixed field lengths and
things like uniq don't have a delimiter option... so...

Questions I have are with regards to large sizes and large inode sizes.
I know that in some tools, they'll just run them together with the rest
of the fields on overflow of the presumed field size, other times,
you'll still get the whitespace delimiter.

It's LIKELY that elements will have to isolated, possibly padded and
sometimes also delimited in order for this to work correctly... let me
ponder for a moment... unless somebody else already as the fixes in
mind.

Thanks for the updates (and for making me rethink this).

Thanks for the updates...