[NTLUG:Discuss] Simple file dedupe script
Chris Cox
cjcox at acm.org
Mon Jul 19 10:45:02 CDT 2010
Ok... my changes (and I'll update the file link as well):
find $* -type f -print0 |
xargs -0 ls -sd * | sort -k1bn |
awk '{num=$1;$1="";printf("%10d%s\n",num,$0);}' | uniq -w 10 -D |
sed 's/^[ ]*[0-9]* //' | tr '\012' '\000' |
xargs -0 ls -id | sort |
awk '{num=$1;$1="";printf("%10d%s\n",num,$0);}' | uniq -w 10 |
sed 's/^[ ]*[0-9]* //' | tr '\012' '\000' |
xargs -0 md5sum | sort | uniq -w 32 -D
Yes... I know.. I used awk... sigh....
On Mon, 2010-07-19 at 10:30 -0500, Chris Cox wrote:
>
> On Sun, 2010-07-18 at 11:13 -0500, Wayne Walker wrote:
> > On Sat, Jul 17, 2010 at 11:23:06PM -0500, Chris Cox wrote:
> > > I was answering a forum post where somebody wanted to find all duplicate
> > > images within a directory. I had always wanted to write a file dedupe script...
> > > so here's my take:
> > >
> > > http://endlessnow.com/ten/Source/dedupe-sh.txt
> >
> > Impressive. This is almost exactly how I've planned to solve this
> > problem in the past (size, inode, md5sum, cmp), I never realized it
> > could be done in a single pipe.
> >
> > I have disks with 7 digit inode #'s. So, I would change -D 6 to -D 7
> > (on 6 digit inode it would compare the trailing space.
> >
> > The expense is 99% is md5sum and cmp, so decreasing the invalid size
> > matches is very important.
> >
> > Since file names could have spaces, your cut command could truncate file names.
> >
> > I'd replace:
> >
> > sed 's/^[ ][ ]*//' | cut -f2- -d' '
> >
> > with
> >
> > sed 's/^[ ][ ]*[0-9]*[ ]*//'
> >
> Well.. there are some errors in my logic with regards to the sorts, and
> such. Unfortunately, the tools don't output fixed field lengths and
> things like uniq don't have a delimiter option... so...
>
> Questions I have are with regards to large sizes and large inode sizes.
> I know that in some tools, they'll just run them together with the rest
> of the fields on overflow of the presumed field size, other times,
> you'll still get the whitespace delimiter.
>
> It's LIKELY that elements will have to isolated, possibly padded and
> sometimes also delimited in order for this to work correctly... let me
> ponder for a moment... unless somebody else already as the fixes in
> mind.
>
> Thanks for the updates (and for making me rethink this).
>
>
> Thanks for the updates...
>
>
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
More information about the Discuss
mailing list