[NTLUG:Discuss] convert HTML to SGML

Kendall Clark kclark at ntlug.org
Wed Aug 4 14:21:04 CDT 1999


>>>>> "Mark" == Mark Bickel <exumb at exu.ericsson.se> writes:

    Mark> Yes, of course HTML is a subset of SGML.  

Sorry to have made a big deal of this, but not everyone seems aware of 
this fact, so I was just making sure.

    Mark>  I have a bunch of
    Mark> M$ Word docs that need converting into SGML. The Word docs
    Mark> all have a corporate standard header and footer. There
    Mark> exists corporate standard SGML DTDs that incorporate
    Mark> equivalent headers, footers, page layout, etc. 

I'm not sure I follow this, i.e., how does an SGML DTD incorporate
page layout? Ah, never mind, probably doesn't matter that I follow
it. :>

    Mark> So I can
    Mark> export Word -> HTML.  Matching/replacing tags can be
    Mark> accomplished using roll-your-own scripts to hammer the HTML
    Mark> into SGML that outputs looking close (better) than the
    Mark> original M$Word code. I would prefer a more "out of the box"
    Mark> solution that would streamline the conversion process, as
    Mark> there are thousands of pages that need this conversion.

Yeah, it was the SGML DTD I was trying to get you to tell us. :>

Is this something internal to Ericsson? I think the Erlang system has
some SGML tools, but I may be misremembering.

If this is an internal DTD, then there isn't likely to be much of an
'out of the box' solution that doesn't already exist inside Ericsson.

If this is DocBook, or a modification thereof, or some other
industry-standard DTD, there very well may be a tool that fits well.

At any rate, Omnimark is a good thing to use for DTD-to-DTD
conversions. It's a commercial language, but specifically for Markup
Language work. There's a crippleware version that's rather capable.

You could also look at using DSSSL but the learning curve is rather
steep; probably better to write ad hoc Perl scripts than to pile into
DSSSL.

Last, there is, iirc, a rules-based DTD-to-DTD tool for *XML* DTDs at
IBM's AlphaWorks site. This is really, imo, what you want to use,
i.e., a rules-based DTD-to-DTD transformation tool. The one from IBM
can probably either be adapted to work on SGML or if you're SGML is of
a certain kind (or can be transformed into a certain kind) then your
SGML will just *be* XML and you can use this tool straight away. I
think it's Java, if not, C++.

    Mark> The goal is worthy: elimination of Word as a document format
    Mark> for an entire library of documents that must be maintained
    Mark> and updated on a regular basis, and having one standard -
    Mark> SGML.

Absolutely. Elimination of Word as a document format is an unambiguous 
social good. Elimination of Word altogether would be even better.

Best,
Kendall
--
The doll's trying to kill me, and the toaster's been laughing at me.

		-- Homer Simpson
		   Treehouse of Horror III




More information about the Discuss mailing list