6. DocBook to DocBook Transformations

6.1. XML and SGML DocBook

There are a few changes between DocBook XML and SGML. Handling these differences should be relatively easy for most small documents, and many authors will not need to make any changes to convert their documents other than the XML and DocBook declarations at the start of their document.

For others, here is a list of what you should keep in mind when converting your documents from SGML to XML.

Differences between XML and SGML elements

An XML element typically has three parts: the start tag, the content (your words) and the end tag. Qualifiers are added in the start tag and are known as attributes. They will always have a name and a quoted value.

<filename class="directory">/usr/local<filename>
	

The start tag contains one attribute (class) with a value of directory. The end tag (also filename) must not contain any attributes.

  • Element names (tags) and their attributes are case-dependent--typically lowercase. The following will not validate because the end tag <PARA> is uppercase:

    <para>This part will fail XML validation</PARA>
    
  • All attributes in the start tag must be "quoted". This can be either single (') or double (") quotes, but not reverse (`) or smart quotes. The quote used to start a name="value" pair must be the same quote used at the end of the value. In other words: "this" would validate, but 'that" would not.

  • Tags that have a start tag, but no end tag are referred to as empty because they do not contain (wrap around) anything. These tags must still be closed with a trailing slash (/). For example: xref must be written as <xref linkend="software"/>. You may not have any spaces between the / and >. (Although you may have a space after the final attribute: <xref linkend="foo" />.)

  • Processing instructions that get sent to the transformation engine (DSSSL or XSLT) and must have a question mark at the end of the tag. All processing instructions are removed from the output stream. The XML version of this tag would look like this:

    <?dbhtml filename="foo"?>
    
  • If you're converting from SGML to XML, be sure file names refer to .xml files instead of .sgml. Some tools may get confused if a .sgml file contains XML.

  • Tag minimizations were used in SGML instead of writing out the element name in the end tag. Example: <para>This is foo.</> Tag minimizations are not supported in XML and their use is discouraged in DocBook.

6.2. Changing DTDs

The significant changes between version changes in the DTD involve changes to the elements (tags). Elements may be: deprecated (which means they will be removed in future versions); removed; modified; or added. Almost all authors will run into a changed or deprecated tag when going from a lower version of DocBook to a higher version.

DocBook: The Definitive Guide does an excellent job of showing you how elements fit together. For each element it tells you what an element must contain (its content model) and what is may be contained in (who its parents are). For example: a note must contain a para. If you try to write <note>Content in a note</note> your document will not validate. Learning how elements are assembled will make it a lot easier to understand any validation errors that are thrown at you. If you get truly stuck you can also email the LDP's docbook mailing list for extra hints. Information on subscribing is available from Section 2, “Mailing Lists”

All tags that have been deprecated or changed for 4.x are listed in DocBook: The definitive guide, published by O'Reilly and Associates. This book is also available on-line from http://www.docbook.org.

6.2.1. Differences between version 3.x and 4.x

Here are a few elements that are of particular relevance to LDP authors:

  • artheaderhas been changed to articleinfo. Most other header elements have been renamed to info.

  • graphichas been deprecated and will be removed as of DocBook 5.x. To prepare for this, start using mediaobject. There is more information about mediaobject in Section 5, “Inserting Pictures”.

  • imagedatafile formats must now be written in UPPERCASE letters. If you use lowercase or mixed-case spellings for your file formats, it will fail.

    Valid:

    <imagedata format="EPS" fileref="foo.eps">
    

    Invalid:

    <imagedata format="eps" fileref="foo.eps">