[NTLUG:Discuss] eliminating lines with the same information

Mon Jul 2 14:39:48 CDT 2007

Howdy from a lurker,

I rarely post here, mostly due to time constraints, but I appreciate the
chance to see the discussion.

This is pretty much the same thing I did a few weeks ago.  Being
somewhat of a nutcase for Python, here's what I used:

#!/usr/local/bin/python

import re,sys,sets

addrset=sets.Set()

addrre=re.compile('([0-9a-z_\-\.]+@[0-9a-z_\-\.]+)',re.I)
for l in file(sys.argv[1]):
    m=addrre.search(l)
    if m:
        if not m.group(1) in addrset:
            addrset.add(m.group(1))
            print l,

Save that in a file, and run that file with the argument being the name
of your email list.  It will print out the lines represented by unique
addresses.

Note that it just finds the first email address on each line, and
rejects lines with no email addresses at all.  You may need to modify it
for your purposes, to say the least.

Run it at your risk, of course.

Regards,

Carl

On Mon, Jul 02, 2007 at 12:55:33PM -0500, Lance Simmons wrote:
> I have a text file with several thousand email addresses, many of
> which are duplicates. I've used "sort" and "uniq" to make the list
> smaller, but there are still almost a thousand..
> 
> But I still have many duplicates.  For example, three lines in the file might be
> 
>   jsmith at abc.org
>   "John Smith" <jsmith at abc.org>
>   "Mr. John Smith" <jsmith at abc.org>
> 
> Obviously, I'd like to get rid of two of those lines without having to
> manually go through and decide which to keep.  And I don't care about
> keeping names, I'm only interested in addresses.
> 
> Also, the duplicates are not all on lines near each other, so even if
> I wanted to do it manually, it would be a huge hassle.
> 
> Any suggestions?
> 
> -- 
> Lance Simmons
> 
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss