[NTLUG:Discuss] removing text: fighting tr and regex

Robert Thompson ntlug at thorshammer.org
Fri Jul 7 14:20:02 CDT 2006


> I just wish I understood why it doesn't work in bash/sed/tr.

There are a few things wrong.

	The first is you need to use the -r option to sed to get it to use regular expressions.

$ echo "jdoe at example.com (John Doe)" | sed -re s/\(.*?\)//g

$

	Okay we're making progress. Now it deletes anything instead of anything between parens like we intended. So sed is not seeing the parens like we expected. We escape the parens for the regexp, but you have to remember that before sed sees the string the shell sees it. The escapes are interpreted by the shell and then s/(.*?)//g is passed to sed, which searches for anything and replaces it with nothing.

	The best solution is to put your regexp in single quotes. This is a good habit and how I encode all of my sed scripts.

$ echo "jdoe at example.com (John Doe)" | sed -re 's/\(.*?\)//g'
jdoe at example.com 
$ 

	So we have it working somewhat. It fails on your second example with two email addresses though:

$ echo "jdoe at example.com (John Doe), jane at example.com (Jane Doe)" | sed -re 's/\(.*?\)//g'
jdoe at example.com 
$ 

	The regexp does not work as expected. It is being greedy and grabbing everything from the first parens to the end of the input. The ? does not stop this because ? means '0 or 1 number of the previous thing'. 0 or 1 of * is the same as *.

	At this point to get it to work, I would redefine what we are searching for. We are not searching for anything between parens, we are searching for words (alpha/numerics) between parens. The regexp for alpha-num is [A-Za-z0-9] (amoung others).

$ echo "jdoe at example.com (John Doe), jane at example.com (Jane Doe)" | sed -re 's/\([A-Za-z0-9]*\)//g'
jdoe at example.com (John Doe), jane at example.com (Jane Doe)
$

	Wait, that didn't work. Why not? Apparently there's more than words we are looking for. The string '(John Doe)' also has a space, which is punctuation. So we should search for words and punctuation. Email addresses may also have commas, dashes, and other punctuation. The regexp we should use is [A-Za-z0-9 ,-] (note the dash has to be at the end).

$ echo "jdoe at example.com (John Doe), jane at example.com (Jane Doe)" | sed -re 's/\([A-Za-z0-9 ,-]*\)//g'
jdoe at example.com , jane at example.com 
$

	And since I like it pretty:

$ echo "jdoe at example.com (John Doe), jane at example.com (Jane Doe)" | sed -re 's/ +\([A-Za-z0-9 ,-]*\)//g'
jdoe at example.com, jane at example.com
$

	You get good at regexps by reading "man perlre" many many times :).

Robert Thompson


On Thu, Jul 06, 2006 at 11:31:24AM -0500, Richard Geoffrion wrote:
> Victor Brilon wrote:
> > On Jul 6, 2006, at 3:03 AM, Richard Geoffrion wrote:
> >
> >   
> >> echo "jdoe at example.com (John Doe)" | sed --expression s/\(.*?\)//g
> >>     
> >
> > You're almost there:
> > echo "jdoe at example.com (John Doe)" | sed -e s/\(.*\)//g
> >
> > To badly paraphrase "The Princess Bride": You keep using the ?  
> > qualifier, I don't think it does what you think it does :)
> >
> >   
> Actually... I think I DO understand what the "?" does.. it prevents the 
> following from happening....
> 
> $ echo "jdoe at example.com (John Doe), jane at example.com (Jane Doe)" | sed 
> s/\(.*\)//g
> jdoe at example.com
> 
> Without the ?, the expression is GREEDY and takes more than I want..... 
> but obviously the expression WITH the "?" still doesn't work at the bash 
> prompt. :(
> 
> The PERL statement DOES work with the multiple parenths "( )" though.... 
> I just wish I understood why it doesn't work in bash/sed/tr.
> 
> -- 
> Richard
> 
> 
> 
> 
> _______________________________________________
> http://ntlug.pmichaud.com/mailman/listinfo/discuss



More information about the Discuss mailing list