[NTLUG:Discuss] eliminating lines with the same information

Stuart Johnston saj at thecommune.net
Mon Jul 2 13:14:44 CDT 2007


Extract just the addresses then re-sort|uniq.  I'd use Perl with 
Email::Address from CPAN.

perl -n -MEmail::Address -e '($a) = Email::Address->parse($_); print 
$a->address, "\n";' < emails | sort | uniq


Lance Simmons wrote:
> I have a text file with several thousand email addresses, many of
> which are duplicates. I've used "sort" and "uniq" to make the list
> smaller, but there are still almost a thousand..
> 
> But I still have many duplicates.  For example, three lines in the file might be
> 
>   jsmith at abc.org
>   "John Smith" <jsmith at abc.org>
>   "Mr. John Smith" <jsmith at abc.org>
> 
> Obviously, I'd like to get rid of two of those lines without having to
> manually go through and decide which to keep.  And I don't care about
> keeping names, I'm only interested in addresses.
> 
> Also, the duplicates are not all on lines near each other, so even if
> I wanted to do it manually, it would be a huge hassle.
> 
> Any suggestions?
> 



More information about the Discuss mailing list