[NTLUG:Discuss] Need help debugging simple script commands

Wed Jul 31 20:02:36 CDT 2002

This problem is driving me crazy! (Please help, because that's a short
trip from where I live!)

I've got a file of domain names (one per line) that contains duplicates.
I've been removing the duplicates with:

cat domains | sort | uniq > domains.uniq

That has stopped working. It stopped once before and I found some
garbage in the file. I cleaned out blank lines and trailing spaces and
tabs, and it started working again. (It wouldn't match 'domain-name'
with 'domain-name<TAB>'). Now the cleanup is a standard part of my
routine, and it has worked fine for months, until about a week ago.
The file size has been growing and is up to about 7.5 meg and about
500,000 lines before removing duplicates.

When I say that it doesn't work, I don't mean that it abends with an
error. It takes the same amount of time before it completes, and it 
is removing some of the duplicates, but it is leaving most of them.

I copied about 5k of the file into a test file and it successfully
removed all of the duplicates. That same section of the file is not
deduped when included in the big file.

I think the problem is one of two things:

a) Something is blowing up and I'm not looking in the right place
for the error messages.

b) The file contains some other kind of garbage besides what I am 
cleaning out.

Can anyone agree with that, or have a better answer? Suggestions?

Does anyone have a grep or sed or perl command or two that I can use
that will remove everything that is not legal in a domain name? 
(not the http:, just the rest).

Thanks for your help!

Rick