[NTLUG:Discuss] Re: good "book" format for html? -- DocBook is more simple, more universal

Mon Nov 29 01:21:02 CST 2004

On Mon, 2004-11-29 at 01:23, Kevin Brannen wrote:
> Used to be that way; IIRC, you had to have at least 4 different 
> packages, plus all dependencies by said packages, to run docbook.  I 
> think it's down to 3 now.  Still too much effort for me. :-)

To each his own.  I've just never had a problem in 5+ years.

> Ah, I think we just found the source of the misunderstanding.  
> [...re-reading orginal post...]  Nope, I never mentioned the word 
> "parser" or anything close to it. (Please take that as a correction. :-)
> I have multiple inputs:  ASCII text, HTML, and PDF; I'm the author of 
> some, for most I'm not.  I store them all for future usage.

How do you plan on getting PDF into HTML?
There is a converter, although it's not foolproof.

> I transport them to various devices (different OS's) for reading.
> The goal is to future-proof them, and make them convenient to transport.
> To me, HTML is the most guaranteed format for that which still has
> markup/formatting capabilities.

I'm still scratching my head here.  Are you saying you want to build
an HTML page with links to the source documents?  Now that's totally
different.  HTML is what you want for that.

> Plus it's easy to grep for text in HTML if I'm searching for
> something. :-)

Again, are you saying you simply want to build an HTML page with links
to the source documents?  Again, if that's what you want, then you _do_
want HTML.

> PLus my PDA program wants HTML as input.  (PDF comes close to
> fulfilling my needs for future-proofing,

Yes, HTML and PDF are published formats.  For _viewing_, they are
_perfect_.  I thought you meant editing.

> but I'm bothered by the "closed nature" or proprietariness of it. 

Some Postscript/PDF is encumbered by _some_ IP.
But it is _very_open_ from a "standards" viewpoint.

Remember, the concept of "proprietary" isn't bad.
It simply means the IP owner holds it in value.

In fact, the reason why MS Office is so horrific is because it _lacks_
proprietary standards.  If it was proprietary, then it would be backward
compatible.

That's the difference between what I call "Commerceware" (closed
standard, closed source) and "Hostageware" (unmaintainable standard
and/or unmaintainable source).

In fact, Postscript/PDF is really more of either "Standardware" (open
standard, closed source)  "Sourceware" (closes standard via IP, but open
source implementation).

> Alas, PDF displays very poorly on my PDA too, while HTML comes out
> very nice.)

Of course.  PDF is build for 600dpi publication, although it's tolerable
at 75-100dpi on-screen.  Your PDA is only a subset size though, so you
have to zoom.

> Now to be completely honest, when I'm transforming PDF into HTML, I do 
> run the output of "pdftohtml" thru a perl script I wrote to "clean up" 
> the output, but that's to simplify the HTML and to find where P tags 
> should be inserted and to do so.

But how does it come out?

Converting from HTML or PDF which are "publication" formats absolutely
stinks, because there is absolute _no_ structure to those formats.  So
you're stuck, unless you can get the original author to give you the
original.

Otherwise, that's not something you can help, because you cannot tell
the writer to give you something else.  That's a constraint I didn't
catch.

In reality, you just want HTML to link to more HTML that is already
written.  Again, that's 100% different.

Now if you are writing something _new_, I'd recommend you'd consider
looking elsewhere, or create your own.  Because you want to use a
long-term, _editable_ form that can be converted to _any_ format that
may be preferred.

> But that has nothing to do with my original question. :-)
> To restate my original question a bit differently:  Are there any 
> (preferablly open) *standard* formats and/or programs that can 
> create/utilize HTML files with their supporting data files (e.g. images) 
> embedded in them?

For _new_ documentation, correct?

You're basically talking _any_ of the major XML standards:  DocBook,
OpenOffice XML, etc..., along with the "mother of all open typeset," TeX
(and its most popular macro, LaTeX).

TeX is not parser friendly, and more of a legacy option.  I've been
using LyX for over 5 years, so I'm kinda used to it.  But I can convert
to just about anything else, because LaTeX conversion is priority #1 for
any new language.

Which leaves us with things like DocBook, OpenOffice XML, etc... 
OpenOffice XML is huge and overkill.  I'm sure people will come up with
converters from Adobe Distillered MS Word to OpenOffice XML in the
future, but it's really overkill.

Which leaves us with DocBook.  If you write anything _new_ or have
others write anything _new_, then DocBook is straight-forward.  And it's
really the staple of simple, open documentation.

But if you're going to be integrating massive amounts of text _already_
in a "publication" format, and not a "structured" documentation
language, then that doesn't matter.  You probably just want to "glue"
them together with HTML.  That _does_ make sense.

> Then I can uphold my standard of 1 book 1 file,

So you really just want 1, massive file you can search through, and you
don't care about structure or the ability to reformat?

Then HTML is your baby.

> yet still have it all in a standard easy to create & use format that
> will be useful for a long time to come

I would really deter you from _creating_ anything in HTML.  HTML is
unmaintainable.  I would have 2 files.

1.  HTML "dump" of non-original stuff

A single file "dump" of everything you have pieced together from
"published" formats (HTML, PDF converted to HTML, etc...).  This could
be basic, no consistency, just "raw" format and sectioned.  As you
mentioned, you would search by keyword, etc...

2.  Standard Documentation Language of all _original_ stuff

This would be stuff that you can control -- you either wrote it, or you
were given the "raw content" without the crap.  You could link to the
other file of "non-original" stuff that is generally disorganized.

Over time, as you find crucial pieces that you want to "reformat" into a
your standard documentation language, then you could re-write it/quote
it from #1 to #2.  But that would be at your leisure.  You'd still have
the references to #1 in #2 as necessary.

> (and when stops being a standard, be easy enough to update to the "new
> long term thing").

The idea is that "publication" standards might change, or the end-user
applications that read they may.  Documentation markups don't.  They are
not only eternal, but they can be converted _to_ any other, contemporary
"publication" standard.

The IEEE has stuff written over 20 years ago that is quickly converted
into _native_ PDF format -- not scanned, not "bitmapped," but _native_
vector _with_ TOC, indexes, inter-document references, etc... with
near-0 re-writing effort  How?  TeX.  A standard documentation language
allows parsers to be written decades down the road.

Especially when it comes to stuff like equations.  It's the main reason
why the American Mathematical Society (AMS) and the IEEE started using
Knuth's TeX over 2 decades ago.  Today, almost anything in the IEEE's
longstanding LaTeX templates can be converted to OpenOffice XML, which
means it can be converted to MS Word fairly well.

> [XHTML is probably going to be the next long term thing, but as it's
> mostly HTML4.0 compatible, I'm not too worried right now.]

DocBook, OpenOffice XML and LaTeX can be converted into _your_choice_ of
HTML version, including with CSS, XHTML, etc..., even IBM SGML.

Again, it's your choice.

For non-original documents you don't have any control over, definitely
go the HTML route.  Piece them together, add a few <A NAME> tags as well
as <A HREF> links and search as necessary.

But for anything new, consider a standard documentation language. 
DocBook is straight-forward.  And it lets you convert into any published
typeset, be it HTML, PDF, etc... as you wish.

-- 
Bryan J. Smith                                    b.j.smith at ieee.org 
-------------------------------------------------------------------- 
Subtotal Cost of Ownership (SCO) for Windows being less than Linux
Total Cost of Ownership (TCO) assumes experts for the former, costly
retraining for the latter, omitted "software assurance" costs in 
compatible desktop OS/apps for the former, no free/legacy reuse for
latter, and no basic security, patch or downtime comparison at all.