[NTLUG:Discuss] Simple file dedupe script
Wayne Walker
wwalker at solid-constructs.com
Sun Jul 18 11:13:15 CDT 2010
On Sat, Jul 17, 2010 at 11:23:06PM -0500, Chris Cox wrote:
> I was answering a forum post where somebody wanted to find all duplicate
> images within a directory. I had always wanted to write a file dedupe script...
> so here's my take:
>
> http://endlessnow.com/ten/Source/dedupe-sh.txt
Impressive. This is almost exactly how I've planned to solve this
problem in the past (size, inode, md5sum, cmp), I never realized it
could be done in a single pipe.
I have disks with 7 digit inode #'s. So, I would change -D 6 to -D 7
(on 6 digit inode it would compare the trailing space.
The expense is 99% is md5sum and cmp, so decreasing the invalid size
matches is very important.
Since file names could have spaces, your cut command could truncate file names.
I'd replace:
sed 's/^[ ][ ]*//' | cut -f2- -d' '
with
sed 's/^[ ][ ]*[0-9]*[ ]*//'
> Feel free to modify, etc.
>
> The result requires the final step... that is, making SURE they truly are
> duplicates (likely) and then ... well... what to do with the information.
>
> Enjoy,
> Chris
Thank you! it's already running on 1.1 TB / 7 million inodes (45M
directory entries).
My modified version is at http://gist.github.com/480511 . It includes
all the original text, and a pointer to your location for the file.
--
Wayne Walker
wwalker at solid-constructs.com
(512) 633-8076
Senior Consultant
Solid Constructs, LLC
More information about the Discuss
mailing list