HTML, XML, XSLT, DOM and how they work

Background

HTML has had a colourful history -- starting out as a simple hypertext markup language, then being extended by different browsers, and later more or less standardized by an iterative on-going process.

In today's world, HTML is on the verge of yet another change. A hypertext document is no longer just HTML. HTML has been enhanced, or even superseded, with the following things:

Cascading Style Sheets (CSS)
Document Object Model (DOM)

Basically, a HTML document is actually a DOM tree, with various CSS attributes attached to each leaf of the tree, describing how it should be rendered.

DOM

Here's how this document is laid out in Mozilla Firefox's DOM browser:

http://bisqwit.iki.fi/jutut/kuvat/xmlhtml/domthumb.png

DOM trees can be created with various means; HTML is not the only way. Arguably not even the best way.

Javascript

You can create a DOM tree from scratch with mere javascript.

However, JS is intended to create functionality, not content. Let's concentrate on the things that create content:

XML

XHTML

Many HTML authors have heard of XHTML. XHTML is basically HTML that has been formalized to the degree that is also valid XML.

In XHTML, HTML is a subset of the XML universe, with its own namespace. In an XHTML document, you can import also other subsets of the XML universe, such as MathML and SVG.

Here's an example XHTML document (ex.xhtml) (click to view as rendered):

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <!!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
   <head><title>Example document</title></head>
   <body><h1>Example document</h1>
    <div>Yay, it works!</div>
   </body>
  </html>

As you can see, XHTML is not much different from HTML.

The important differences are:

Tags are written in lowercase, except for the DOCTYPE tag.
To embed non-XML data, such as scripts and stylesheets, instead of , you should use <![CDATA[ content ]]>. Works for plain text too. (Especially great when used to put code samples between <pre> and </pre>.)
Errors in structure and syntax are not tolerated.   must be written as  , or otherwise the parser expects to see a  somewhere nearby.  must have a matching .
Tag parameters must be quoted.   does not work, it must be written as  .
The HTML namespace must be introduced via the xmlns property.
You can include MathML, SVG, and even FoobarXYZ1234 if you define its namespace!
document.write does not work in javascript. You need to use the DOM functions.

But, XHTML is not all that there is to XML + HTML.

Let's cover the other possibilities.

XML + CSS

You can create DOM documents without HTML. All you need is an XML document and a CSS file.

Try this file, test.xml (click to view as rendered):

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <?xml-stylesheet href="test.css" type="text/css"?>
  <message>
   <hello>Hello, World</hello>
  </message>

And this file, test.css:

  message { display:block; background:#FFF; color:#000 }
  hello { display:block; margin:1em; padding:1em;
   background:#DDF; color:#000; font-size:200% }

This produces a DOM document even though there's no single HTML tag in it.

The most important CSS feature to this purpose is the display property. Using it, you can create your own tags for tables, lists, and everything else.

This combination is mostly useful for visualizing data that is primarily provided in XML format. Data such as reports and statistics.

XML + XSLT

XSLT is a method of transforming an XML document into another XML document. You can transform an XML document into an HTML document for example.

Try this file, rec.xml (click to view as rendered):

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <?xml-stylesheet href="rec.xsl" type="text/xsl"?>
  <records>
   <title>SMB2j records</title>
   <record player="Mario">
    <time>08:21</time>
    <screenshot>http://w-create.com/~bisqwit/nesvideos/smb2jpg.png</screenshot>
   </record>
   <record player="Luigi">
    <time>08:30</time>
    <screenshot>http://w-create.com/~bisqwit/nesvideos/smb2jl.png</screenshot>
   </record>
  </records>

Download rec.xsl:

  <?xml version="1.0" encoding="UTF-8"?>
  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:output method="html" indent="no"
    doctype-public="-//W3C//DTD HTML 4.01//EN"
    doctype-system="http://www.w3.org/TR/REC-html40/strict.dtd" />

   <xsl:template match="/records">
    <html>
     <head>
      <title><xsl:value-of select="title" /></title>
     </head>
     <body>
      <h1><xsl:value-of select="title" /></h1>
      <table border="1" width="400">
       <xsl:for-each select="record">
        <tr>
         <td rowspan="2" width="256"><img src="{screenshot}" alt="screenshot" /></td>
         <th>Player name:</th> <td><xsl:value-of select="@player" /></td>
        </tr><tr>
         <th>Completion time:</th> <td><xsl:value-of select="time" /></td>
        </tr>
       </xsl:for-each>
      </table>
     </body>
    </html>
   </xsl:template>
  </xsl:stylesheet>

Every tag that begins with xsl: here is an XSLT tag, all others are HTML tags. (The xsl prefix is actually arbitrary; you can use whatever you like as long as you define it using the xmlns property.)

Of course, you can also include a CSS and javascripts in the HTML, if you wish.

A more complex example of an XML+XSLT document can be found here.

Using XSLT is especially useful for developing online applications. Instead of developing a server-side template engine, you can create a client-side XSLT file which defines the rendering rules, and have the application output mere XML. This allows great savings in bandwidth, and also saves you from the trouble of having to implement a separate XML interface for those users who want data directly from your site.

XSLT can even implement algorithms. For example, see this Mandelbrot fractal renderer. A larger collection of cool XSLT tricks can be found at IncrementalDevelopment.

References

W3C specifications for CSS2, DOM, XSLT, XML and XHTML

The author is a software engineer. This page was created in 2008.

-Joel Yliluoma