[NTLUG:Discuss] Re: good "book" format for html? -- problem from a different angle, your 3 options ...

Sun Nov 28 18:21:41 CST 2004

On Sun, 2004-11-28 at 18:55, Bryan J. Smith wrote:
> I just gotta revisit this.  I'm not trying to be critical, but DocBook
> is even _simpler_ than HTML.  You focus on the structure, not the
> combined structure/format like HTML.  And it's cake to piece together.

Let me show you the problem from a different angle.  Let's assume you
are going to use HTML.  As I mentioned, HTML does _not_ have a structure
built for anything but individual page formatting -- if you can call
that structure at all.  In fact, HTML's default tags are really
formatting, not structural ones.

As such, your parsers have to handle all sorts of complex combinations. 
I don't want to even think of what the state diagrams look like for them
-- let alone how far they will "snake out."

Okay, let's say you want to avoid that.  How?  You are going to have to
"assign" structure "assumptions" than all of your authors much now
follow.  E.g., <H1> is the book title, <H2> is the chapter heading,
etc...  What about the book subtitle?  Well, let's assume you figure all
those out so they don't conflict.

Let's assume you add all these "constraints" in the "proper HTML" so
this book format will work.  Let's assume that's fine.

A most obvious issue is that your authors might not follow them.  Okay,
let's assume they will.  But even more problematic, they are using an
HTML editor that might not write them in a way that your parsers like. 
Especially if your constraints are written so your parsers don't go nuts
in the endless combinations.

Again, I wouldn't even want to start looking at the state diagrams of
how such a parser would handle HTML.  This is added overhead that would
personally drive me nuts.

Which brings us back to something like DocBook.  HTML, as originally
standardized by the W3C, is largely based on IBM's old SGML, and focuses
largely on format-only.  DocBook is also based on IBM's old SGML, but
completely separates content from style.

Ultimately, when it comes to writing and submission of portions of a
document, content is king.  We want the pieces from different authors to
fit together.  Ideally we'd like a language where there are the fewest
assumptions, and a small, tight and comment language.  That's DocBook!

DocBook originally started out as SGML.  But with the advent of XML for
creating standards (remember, XML is not a standard, but a standard for
creating standards ;-), DocBook is now typically written in to its OASIS
standardized XML.  Of course, conversion between the two is straight
forward with its parsers, but XML is the way since it makes life easier
for parsers versus SGML (and especially versus something like old
TeX-based typeset ;-).

I guess what I'm saying is that you have 3 choices (at least from my
angle, I could be wrong).

1.  Use HTML -- constraints required, parser hell

HTML is a simple XML instance designed for basic web publication --
formatting only, very minimal style.  It is the worst for storing
content in a consistently editable form.

2.  Use another XML instance -- DocBook, OpenOffice XML, etc...

Leverage the existing base of XML languages focused on strict content
and structure, have a wealth of stylesheets for them and, best of all,
can be converted to/from other popular XML instances (as well as TeX),
and to just about any major "publication" format (HTML, PDF, RTF,
etc...).  OpenOffice XML programs also leverage MathML for equations, so
just following their lead would be a good move if equations are king in
your needs.

Anything XML-based is ideal, although DocBook is the most
straight-forward and lets you focus on 100% content from the parser
aspect.  Anything TeX-based is probably the least ideal, writing parsers
for TeX is not fun -- although it's nice to know that if you can get to
SGML, DocBook, OpenOffice XML or similar, you can convert to at least
LaTeX if needbe.  Because TeX is old, and there is a lot out there for
it (including lots of feature-rich parsers, like for PDF publication).

3.  Create your own XML instance -- Start with HTML, then perfect it 

If you're like most people, you hate the thought of not invented here
(NIH).  As such, understand there is _nothing_ stopping you from taking
HTML and changing it into something you like!  Give it structure.  Give
it format.  Have fun!  Build it around your parser ideals and run with
it!

And the best way to do that is to follow strict guidelines on creating
any new markup language -- hence XML.  Start with HTML's XML instance,
and "cut the fat" you don't want.  Then add in your own tags, structure,
etc...  Now create parsers to take your "modified HTML" and turn it into
"pure HTML" that can be rendered in a browser.

And you're there!  If others like it, damn, you might just invent
something the world likes!  Fame, babes, fortune, etc... (okay, maybe
only the first one).  And if not, you have something for yourself and
others -- something that _can_ be converted to/from into other XML
standards with the development of other parsers.

-- 
Bryan J. Smith                                    b.j.smith at ieee.org 
-------------------------------------------------------------------- 
Subtotal Cost of Ownership (SCO) for Windows being less than Linux
Total Cost of Ownership (TCO) assumes experts for the former, costly
retraining for the latter, omitted "software assurance" costs in 
compatible desktop OS/apps for the former, no free/legacy reuse for
latter, and no basic security, patch or downtime comparison at all.