[NTLUG:Discuss] bash question
Fred James
fredjame at concentric.net
Wed Sep 12 22:11:06 CDT 2001
A tip of the hat, and a "from the heart" thank you.
This thread brought forth such cool stuff - I may smile for a week.
Richard Cobbe wrote:
> Lo, on Wednesday, September 12, Wrenn, Bobby J. did write:
>
>
>>If I can get an answer to this I will finally be able to use Linux at
>>work.
>>
>>I need to take 209 pdf files with spaces in the file names and convert
>>them into text. I am very new to scripting and know nothing about regular
>>expressions.
>>
>
> While the available documentation on regular expressions tends to be pretty
> opaque, I'd highly recommend taking the time to read up on them and figure
> out how they work. They show up in lots of different contexts and are very
> useful. I think you will find that it's time well spent.
>
>
>>Is there an easy way to remove the spaces from the file names? Then how
>>do I recursively submit the files to pdftotext with the same name except
>>for the .pdf changed to .txt?
>>
>
> Well, I'm sure all of the solutions that have been posted are quite nice,
> but they're also *way* overcomplicated. tr? sed? awk? Oy! You can do
> this all in the shell, except of course for the pdftotext bit.
>
> First, as many people have suggested, you don't necessarily have to get rid
> of the spaces in your filenames; you can either surround the entire
> filename with quotes or backslash each of the space characters. If,
> however, you think that this is a real headache, you can get rid of the
> spaces pretty easily using a bash parameter expansion goodie:
>
> (This assumes that you want to process all of the .pdf files in the current
> directory.)
>
> for file in *.pdf ; do
> mv "$file" ${file// /-}
> # quotes around the first file are necessary to handle spaces correctly
> # inside the curly braces, that's f i l e slash slash space slash hyphen
> done
>
> In English, this means:
> For each file in the current directory which matches *.pdf:
> set $file to the filename
> mv $file to $file-with-all-spaces-replaced-by-hyphens
>
> To run them through pdftotext, the following will work nicely (even if
> you've still got spaces in your filenames):
>
> for file in *.pdf ; do
> pdftotext "$file" > "${file/%.pdf/.txt}"
> # Or however you invoke pdftotext; I don't have it installed, so I
> # can't check the manpage.
> done
>
> In English:
> for each file in the current directory which matches *.pdf:
> set $file to the filename
> run pdftotext on $file (escaping any spaces in the filename),
> redirecting output to $file-with-a-final-.pdf-replaced-with-.txt
> (again escaping any spaces in the filename).
>
> See the `Parameter Expansion' section of the bash man page---and in fact
> the bash man page in general---for more information. I think there's also
> an O'Reilly book out on the various shells that you may want to look into.
>
>
>>Just getting that much done will be a big help. The next step may be
>>trickier. I need to extract a name, address, and equipment list from each of
>>the files and get it into some kind of database where I can query for total
>>by item or item by location.
>>
>
> This is almost certainly possible, but as another poster said, it depends
> heavily on the format of the text files. He suggested awk; I'd go with
> Perl, but that's primarily because I know it better. You can most likely
> do it with either.
>
> (And, to those of you who recall my rants several months back about why I
> don't like Perl, no, I still don't like Perl. <grin> This is, however,
> one spot where it's most likely the best tool for the job.)
>
> Back to the point: Bobby, while my code above will do what you need, you
> will get a lot more out of this in the long run if you sit down with the
> bash man page or the O'Reilly book and figure out exactly why and how it
> works. I'd highly recommend investing the time and effort; it will pay off
> bigtime down the road. You may also want to do the same for the other
> posters's suggestions.
>
> Richard
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
>
>
--
...make every program a filter...
More information about the Discuss
mailing list