[NTLUG:Discuss] scratching head
Steve Baker
sjbaker1 at airmail.net
Thu Feb 27 23:17:20 CST 2003
Fred James wrote:
>
> Very close - the "..." in this case just means any string of characters
> that you want to search for, though I admit that isn't standard
> notation. For example:
> find . -type f | xargs file | grep -i text | cut -f1 -d: | xargs grep
> "hello"
Ah - that makes more sense...but it's still going to search non-text
files when the filename contains the string 'text'...which will result
in some interesting issues in some cases.
> ...as a side note, experience has shown me that on
> certain system you may want to include something like "grep -v proc"
> (this seems to be system dependent), especially if you start at / in
> your search, to avoid getting hung up in some endless mess. Maybe
> someone could shed some light on that?
Well, the '/proc' directory on Linux (at least) is not really a set
of files on disk somewhere - each 'file' is generated on-the-fly from
the kernel somewhere. So when you open the file and read it, those I/O
requests get routed to some status-producing module somewhere.
Since a number of the 'files' contain things like the complete contents
of a program's address space, 'find' and 'grep' are very likely to turn
up some "interesting" things that'll tie your program into knots.
Reading these 'files' can also screw up some programss. In some older
versions of the kernel, you couldn't run 'more' or 'less' on the files
in /proc. That seems to have been fixed in more recent kernels...but
these are still VERY strange files. 'ls -l' says most of them are of
zero length - yet 'wc -c' does not agree!
The problem with your script is that the 'file' program's idea of what
is 'text' is pretty broad and 'find's idea of what is a "regular file"
is also rather lax - so you are hitting a bunch of things that are patently
NOT simple text files - then stuffing a bunch of random garbage into grep.
If you get any kind of match for your test string in an essentially
binary file, you get a LOT of crap appearing on your output. Possibly
gigabytes of crap.
Doing this right is really a hard problem since there is a fine line
between what is 'text' and what is crap...and computers are not good
at telling the difference!
---------------------------- Steve Baker -------------------------
HomeEmail: <sjbaker1 at airmail.net> WorkEmail: <sjbaker at link.com>
HomePage : http://www.sjbaker.org
Projects : http://plib.sf.net http://tuxaqfh.sf.net
http://tuxkart.sf.net http://prettypoly.sf.net
-----BEGIN GEEK CODE BLOCK-----
GCS d-- s:+ a+ C++++$ UL+++$ P--- L++++$ E--- W+++ N o+ K? w--- !O M- V-- PS++ PE- Y-- PGP-- t+ 5 X R+++ tv b++ DI++ D G+ e++ h--(-) r+++ y++++
-----END GEEK CODE BLOCK-----
More information about the Discuss
mailing list