[NTLUG:Discuss] bash question

Wed Sep 12 18:00:59 CDT 2001

Lo, on Wednesday, September 12, Wrenn, Bobby J. did write:

> If I can get an answer to this I will finally be able to use Linux at
> work.
> 
> I need to take 209 pdf files with spaces in the file names and convert
> them into text.  I am very new to scripting and know nothing about regular
> expressions.

While the available documentation on regular expressions tends to be pretty
opaque, I'd highly recommend taking the time to read up on them and figure
out how they work.  They show up in lots of different contexts and are very
useful.  I think you will find that it's time well spent.

> Is there an easy way to remove the spaces from the file names?  Then how
> do I recursively submit the files to pdftotext with the same name except
> for the .pdf changed to .txt?

Well, I'm sure all of the solutions that have been posted are quite nice,
but they're also *way* overcomplicated.  tr?  sed?  awk?  Oy!  You can do
this all in the shell, except of course for the pdftotext bit.

First, as many people have suggested, you don't necessarily have to get rid
of the spaces in your filenames; you can either surround the entire
filename with quotes or backslash each of the space characters.  If,
however, you think that this is a real headache, you can get rid of the
spaces pretty easily using a bash parameter expansion goodie:

(This assumes that you want to process all of the .pdf files in the current
directory.)

for file in *.pdf ; do
    mv "$file" ${file// /-}
    # quotes around the first file are necessary to handle spaces correctly
    # inside the curly braces, that's f i l e slash slash space slash hyphen
done

In English, this means:
    For each file in the current directory which matches *.pdf:
        set $file to the filename
        mv $file to $file-with-all-spaces-replaced-by-hyphens

To run them through pdftotext, the following will work nicely (even if
you've still got spaces in your filenames):

for file in *.pdf ; do
    pdftotext "$file" > "${file/%.pdf/.txt}"
    # Or however you invoke pdftotext; I don't have it installed, so I
    # can't check the manpage.
done

In English:
    for each file in the current directory which matches *.pdf:
        set $file to the filename
        run pdftotext on $file (escaping any spaces in the filename),
            redirecting output to $file-with-a-final-.pdf-replaced-with-.txt
            (again escaping any spaces in the filename).

See the `Parameter Expansion' section of the bash man page---and in fact
the bash man page in general---for more information.  I think there's also
an O'Reilly book out on the various shells that you may want to look into.

> Just getting that much done will be a big help. The next step may be
> trickier. I need to extract a name, address, and equipment list from each of
> the files and get it into some kind of database where I can query for total
> by item or item by location.

This is almost certainly possible, but as another poster said, it depends
heavily on the format of the text files.  He suggested awk; I'd go with
Perl, but that's primarily because I know it better.  You can most likely
do it with either.

(And, to those of you who recall my rants several months back about why I
don't like Perl, no, I still don't like Perl.  <grin>  This is, however,
one spot where it's most likely the best tool for the job.)

Back to the point: Bobby, while my code above will do what you need, you
will get a lot more out of this in the long run if you sit down with the
bash man page or the O'Reilly book and figure out exactly why and how it
works.  I'd highly recommend investing the time and effort; it will pay off
bigtime down the road.  You may also want to do the same for the other
posters's suggestions.

Richard