[NTLUG:Discuss] eliminating lines with the same information

Mon Jul 2 13:42:14 CDT 2007

Lance Simmons wrote:

>I have a text file with several thousand email addresses, many of
>which are duplicates. I've used "sort" and "uniq" to make the list
>smaller, but there are still almost a thousand..
>
>But I still have many duplicates.  For example, three lines in the file might be
>
>  jsmith at abc.org
>  "John Smith" <jsmith at abc.org>
>  "Mr. John Smith" <jsmith at abc.org>
>
>Obviously, I'd like to get rid of two of those lines without having to
>manually go through and decide which to keep.  And I don't care about
>keeping names, I'm only interested in addresses.
>
>Also, the duplicates are not all on lines near each other, so even if
>I wanted to do it manually, it would be a huge hassle.
>
>Any suggestions?
>
>  
>
Lance Simmons
(1) the awk script below will extract only lines that contain a '@' 
character (as in email addresses)
(2) the sed script below will then remove the '<' and '>' characters, if any

An example command line would be ...
 ># gawk -f sandbox.awk inputfile | sed -f sandbox.sed

An example of the output from your file would be ...
jsmith at abc.org
jsmith at abc.org
jsmith at abc.org

... and then of course sort and uniq would apply very nicely ... 
something like ...
 ># gawk -f sandbox.awk inputfile | sed -f sandbox.sed | sort | uniq > 
outputfile
... would do it for you, I should think.

Hope this helps
Regards
Fred James

[AWK script]
{
        for(i=1;i<=NF;i++) {
                if (index($i,"@")) {
                        print $i
                } else {
                        continue
                }
        }
}

[SED script]
s/<//g
s/>//g