Pages

Saturday, September 25, 2010

DTDs and XML Schemas

What DTDs and XML Schemas 

Document Type Definition (DTD) is a set of markup declarations that define a document type for SGML-family markup languages (SGML, XML, HTML). DTDs were a precursor to XML schema and have a similar function, although different capabilities.
DTDs use a terse formal syntax that declares precisely which elements and references may appear where in the document of the particular type, and what the elements’ contents and attributes are. DTDs also declare entities which may be used in the instance document.
XML uses a subset of SGML DTD.

Document Type Definitions and XML Schemas both provide descriptions of document structures. The emphasis is on making those descriptions readable to automated processors such as parsers, editors, and other XML-based tools. They can also carry information for human consumption, describing what different elements should contain, how they should be used, and what interactions may take place between parts of a document. Although they use very different syntax to achieve this task, they both create documentation.

Perhaps the most important thing DTDs and XML Schemas do is set expectations, using a formal vocabulary and other information to lay ground rules for document structures. Two parsers, given a document and a DTD, should have the same opinions about whether that document is valid, and different schema processors should similarly agree on whether or not a document conforms to the rules in a given schema. XML editing applications can use DTDs and schemas as frameworks, letting users create documents that meet these expectations. Similarly, developers can use DTDs and XML Schemas as a foundation on which to plan transformations from one format to another. By agreeing to a given DTD or schema, a group of developers has accepted a set of rules about document vocabulary and structure. While this doesn't solve all the problems of application development, it does at least mean that independent development of tools that process these documents is a lot easier.
Schemas and DTDs provide a number of additional functions that make contributions to document content:
  • Providing defaults for attributes: in addition to providing constraints on attribute content, DTDs and XML Schemas allow developers to specify default values that should be used if no value was set in the content explicitly.
  • Entity declaration: DTDs and XML Schemas provide for the declaration of parsed entities, which can be referenced from within documents to include content.
Schemas and DTDs may also describe "notations" and "unparsed entities", adding information to documents that applications may use to interpret their content.

Where DTDs and Schemas Come From

The main thrust of development work, initially for XML 1.0 and its DTDs, and now for XML Schemas, is taking place at the World Wide Web Consortium (W3C). However, the W3C is not the only source for schema languages. At least five other schema proposals have been developed and many of them are in actual use -- notably, Microsoft's XML-Data, which is used for its BizTalk initiative. Most of these proposals are feeding into the main W3C-sanctioned development process. The main contenders in the schema arena, including DTDs, are listed below:
  • DTDs - Document Type Definitions were originally developed for XML's predecessor, SGML. They use a very compact syntax and provide document-oriented data typing. XML DTDs are a subset of those available in SGML, and the rules for using XML DTDs provide much of the complexity of XML 1.0. Complete XML DTD support is (or should be) built into all validating XML parsers, and some XML DTD support is built into all XML parsers.
  • XML-Data/XML-Data Reduced - Based on a proposal that Microsoft and others submitted to the W3C even before XML 1.0 was completed, this schema proposal is used in Microsoft's BizTalk framework. XML-Data provides a large set of data types more appropriate to database and program interchange. XML-Data support is built into Microsoft's XML parser.
  • Document Content Description (DCD) - Created in a joint effort between IBM and Microsoft, DCD uses some ideas from XML-Data and some syntax from another W3C project, Resource Description Framework (RDF).
  • Schema for Object-Oriented XML (SOX) - SOX was developed by Veo Systems (now acquired by CommerceOne) and provides functionality like inheritance to XML structures. SOX has gone through multiple versions. The latest is SOX version 2.
  • Document Description Markup Language (DDML) - DDML was developed on the XML-dev mailing list, creating a schema language with a subset of DTD functionality. Development of DDML (which was once known as XSchema) has halted since the W3C Activity began.
Although you can start work with any of the above tools today -- DTDs being widely supported -- when the specification is complete, using the W3C XML Schemas is probably the safest long-term solution. Fortunately, converting among different schema formats isn't especially difficult, and tools are available to help you in the process.

How Schemas Differ from DTDs

The first, and probably most significant, difference between XML Schemas and XML DTDs is that XML Schemas use XML document syntax. While transforming the syntax to XML doesn't automatically improve the quality of the description, it does make those descriptions far more extensible than they were in the original DTD syntax. Declarations can have richer and more complex internal structures than declarations in DTDs, and schema designers can take advantage of XML's containment hierarchies to add extra information where appropriate -- even sophisticated information like documentation. There are a few other benefits from this approach. XML Schemas can be stored along with other XML documents in XML-oriented data stores, referenced, and even styled, using tools like XLink, XPointer, and XSL.
The largest addition XML Schemas provide to the functionality of the descriptions is a vastly improved data typing system. XML Schemas provide data-oriented data types in addition to the more document-oriented data types XML 1.0 DTDs support, making XML more suitable for data interchange applications. Built-in datatypes include strings, booleans, and time values, and the XML Schemas draft provides a mechanism for generating additional data types. Using that system, the draft provides support for all of the XML 1.0 data types (NMTOKENS, IDREFS, etc.) as well as data-specific types like decimal, integer, date, and time. Using XML Schemas, developers can build their own libraries of easily interchanged data types and use them inside schemas or across multiple schemas.
The current draft of XML Schemas also uses a very different style for declaring elements and attributes to DTDs. In addition to declaring elements and attributes individually, developers can create models -- archetypes -- that can be applied to multiple elements and refined if necessary. This provides a lot of the functionality SOX had developed to support object-oriented concepts like inheritance. Archetype development and refinement will probably become the mark of the high-end schema developer, much as the effective use of parameter entities was the mark of the high-end DTD developer. Archetypes should be easier to model and use consistently, however.
XML Schemas also support namespaces, a key feature of the W3C's vision for the future of XML. While it probably wouldn't be impossible to integrate DTDs and namespaces, the W3C has decided to move on, supporting namespaces in its newer developments and not retrofitting XML 1.0. In many cases, provided that namespace-prefixes don't change or simply aren't used, DTD's can work just fine with namespaces, and should be able to interoperate with namespaces and schema processing that relies on namespaces. There will be a few cases, however, where namespaces may force developers to use the newer schemas rather than the older DTDs.

Alternative Approaches

As exciting as XML Schemas are, there have been a few suggestions for very different approaches that also hold promise. Both Rick Jelliffe's Schematron and the Document Structure Description (DSD), from AT&T Labs and the University of Aarhus, look at documents from a more complex perspective than containment, and use tools derived from style languages -- Schematron is based on XSL, while DSD works from CSS -- to examine documents more closely.
Schematron allows developers to ask about the existence and contents of paths through documents rather than specify containment structures, and places great importance on producing human-readable results. Schematron processing, which can use XSL tools, can produce complete reports on the content and structure of documents, rather than a simple yes/no validation with error reporting.
DSD comes from somewhat similar origins, but uses its own vocabulary to create document descriptions rather than building on the XSL processing model. DSD schemas look much more like the W3C's XML Schemas, but support a different set of tests and have a much greater focus on tasks like providing default content for attributes and elements. DSD allows for context-sensitive rules, where the required usage of a given element may change depending on how it is used in a document. Attributes which are optional in one context may be required in another context. Declarations may impose order on some elements but not on others, making it possible to create 'floating' elements. An open-source implementation in C is available, which adds error information to the document as it is processed, giving applications or users a chance to react to the errors.
It isn't clear at this point whether these approaches will be integrated with XML Schemas at some level, or if they'll be useful tools for supplementing or replacing XML Schemas on particular kinds of projects. In any case, both of these projects are worth further investigation.

Planning Around DTDs and Schemas

Transitioning from one technology to another is often difficult, but at least the transition from DTDs to schemas only involves descriptions of documents, requiring only minor changes to the documents themselves. It is uncertain if it's time yet to begin the transition, as the latest public draft of XML Schemas came with a warning on the XML-dev mailing list that there may be significant changes in future drafts. XML Schemas are still far from stable, so probably only the most enthusiastic early adopters should be considering them at this point.
Although XML Schemas may not yet be ready, XML-based projects should be prepared for their eventual arrival. There are several strategies for handling this transition that may be appropriate to different kinds of projects and different developer needs.
  • Develop DTDs with an eye toward future conversion to schemas. Automated tools for converting among schema formats, like Extensibility's XML Authority, are already available and are likely to grow to include the final W3C XML Schemas.
  • Use other schema formats, like XML-Data and SOX. This lets developers take advantage of features like data typing immediately, and conversions from these experimental schema formats to the new XML Schemas shouldn't be prohibitively difficult.
  • Create well-formed documents for now, ignoring DTDs and schemas in their current incarnation. It's not always easy to retrofit a schema onto a set of documents, but it may be appropriate for some cases where the format of existing data sources (like databases) ensures that there's won't be wild variations in structure. When schemas arrive, you can add them to your processing.
  • Ignore DTDs and schemas completely, and only work with well-formed documents. If you don't need structure checking, this may be a perfectly appropriate strategy.
  • Plan to stick to DTDs. They're here now, they'll be here later. If your XML has to be processed by SGML tools, this may be the best route. Keeping your DTDs around, even if you supplement them with equivalent XML Schemas, will preserve interoperability.
There is no single answer for handling this transition that applies to all XML projects. If all your XML work involves documents, DTDs may be a perfectly adequate tool for your needs, and schemas might only be a distraction. If you're trying to manage data interchange between databases of different kinds, the data typing functionality that schemas provide may drive you to use XML-Data or SOX today, and XML Schemas when they arrive.

The Future for DTDs and Schemas

Right now there are too many options for describing your data, but in the future, they will probably slim down to: DTDs, for legacy XML 1.0 applications and integration with SGML; XML Schemas, and plain old well-formed documents for situations where describing document structures is unnecessary or counterproductive. Whatever you do with DTDs and XML Schemas, remember that their usage should be considered a part of document format specification and documentation. Where documentation is important, these tools will be important, both to set expectations and spare applications the task of checking document structures themselves.
The DSD and Schematron approaches will probably receive more attention in future development as well; Schematron is already an easy and useful supplement to both DTD and XML Schema processing. Both of these tools provide functionality that goes beyond anything the W3C has currently released, demonstrating that there are multiple useful approaches to describing document structures. While it seems unlikely that developers will want to create a DTD, an XML Schema, a Schematron schema, and a DSD, all for the same document, they are all important new tools in the XML developer's toolkit.

XML is a very handy format for storing and communicating your data between disparate systems in a platform-independent fashion. XML is more than just a format for computers -- a guiding principle in its creation was that it should be Human Readable and easy to create.
XML allows UNIX systems written in C to communicate with Web Services that, for example, run on the Microsoft .NET architecture and are written in ASP.NET. XML is however, only the meta-language that the systems understand -- and they both need to agree on the format that the XML data will be in. Typically, one of the partners in the process will offer a service to the other: one is in charge of the format of the data.
The definition serves two purposes: the first is to ensure that the data that makes it past the parsing stage is at least in the right structure. As such, it's a first level at which 'garbage' input can be rejected. Secondly, the definition documents the protocol in a standard, formal way, which makes it easier for developers to understand what's available.
DTD - The Document Type Definition
The first method used to provide this definition was the DTD, or Document Type Definition. This defines the elements that may be included in your document, what attributes these elements have, and the ordering and nesting of the elements.
The DTD is declared in a DOCTYPE declaration beneath the XML declaration contained within an XML document:
Inline Definition:

<?xml version="1.0"?>
<!DOCTYPE documentelement [definition]>

External Definition:
<?xml version="1.0"?>
<!DOCTYPE documentelement SYSTEM "documentelement.dtd">

The actual body of the DTD itself contains definitions in terms of elements and their attributes. For example, the following short DTD defines a bookstore. It states that a bookstore has a name, and stocks books on at least one topic.
Each topic has a name and 0 or more books in stock. Each book has a title, author and ISBN number. The name of the topic, and the name of the bookstore are defined as being the same type of element: this store's PCDATA: just text data. The title and author of the book are stored as CDATA -- text data that won't be parsed for further characters by the XML parser. The ISBN number is stored as an attribute of the book:

<!DOCTYPE bookstore [
 <!ELEMENT bookstore (topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
An example of a book store's inline definition might be:
<?xml version="1.0"?>
<!DOCTYPE bookstore [
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

Using an inline definition is handy when you only have a few documents and they're offline, as the definition is always in the file. However, if, for example, your DTD defines the XML protocol used to talk between two seperate systems, re-transmitting the DTD with each document adds an overhead to the communciations. Having an external DTD eliminates the need to re-send each time. We could remove the DTD from the document, and place it in a DTD file on a Web server that's accessible by the two systems:

<?xml version="1.0"?>
<!DOCTYPE bookstore SYSTEM "http://webserver/bookstore.dtd">
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

The file bookstore.dtd would contain the full defintion in a plain text file:
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">

The lowest level of definition in a DTD is that something is either CDATA or PCDATA: Character Data, or Parsed Character Data. We can only define an element as text, and with this limitation, it is not possible, for example, to force an element to be numeric. Attributes can be forced to a range of defined values, but they can't be forced to be numeric.
So for example, if you stored your applications settings in an XML file, it could be manually edited so that the windows start coordinates were strings -- and you'd still need to validate this in your code, rather than have the parser do it for you.
XML Schemas
XML Schemas provide a much more powerful means by which to define your XML document structure and limitations. XML Schemas are themselves XML documents. They reference the XML Schema Namespace (detailed here [1]), and even have their own DTD [2].
What XML Schemas do is provide an Object Oriented approach to defining the format of an XML document. XML Schemas provide a set of basic types. These types are much wider ranging than the basic PCDATA and CDATA of DTDs. They include most basic programming types such as integer, byte, string and floating point numbers, but they also expand into Internet data types such as ISO country and language codes (en-GB for example). A full list can be found here [3].
The author of an XML Schema then uses these core types, along with various operators and modifiers, to create complex types of their own. These complex types are then used to define an element in the XML Document.
As a simple example, let's try to create a basic XML Schema for defining the bookstore that we used as an example for DTDs. Firstly, we must declare this as an XSD Document, and, as we want this to be very user friendly, we're going to add some basic documentation to it:

xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
 <xsd:documentation xlm:lang="en">
   XML Schema for a Bookstore as an example.
 </xsd:documentation>
</xsd:annotation>


Now, in the previous example, the bookstore consisted of the sequence of a name and at least one topic. We can easily do that in an XML Schema:

<xsd:element name="bookstore" type="bookstoreType"/>
<xsd:complexType name="bookstoreType">
 <xsd:sequence>
   <xsd:element name="name" type="xsd:string"/>
   <xsd:element name="topic" type="topicType" minOccurs="1"/>
 </xsd:sequence>
</xsd:complexType>


In this example, we've defined an element, bookstore, that will equate to an XML element in our document. We've defined it of type bookstoreType, which is not a standard type, and so we provide a definition of that type next.
We then define a complexType, which defines bookstoreType as a sequence of name and topic elements. Our "name" type is an xsd:string, a type defined by the XML Schema Namespace, and so we've fully defined that element.
The topic element, however, is of type topicType, another custom type that we must define. We've also defined our topic element with minOccurs="1", which means there must be at least one element at all times. As maxOccurs is not defined, there no upper limit to the number of elements that might be included. If we had specified neither, the default would be exactly one instance, as is used in the name element. Next, we define the schema for the topicType.

<xsd:complexType name="topicType">
 <xsd:element name="name" type="xsd:string"/>
 <xsd:element name="book" type="bookType" minOccurs="0"/>
</xsd:complexType>


This is all similar to the declaration of the bookstoreType, but note that we have to re-define our name element within the scope of this type. If we'd used a complex type for name, such as nameType, which defined only an xsd:string -- and defined it outside our types, we could re-use it in both. However, to illustrate the point, I decided to define it within each section. XML gets interesting when we get to defining our bookType:

<xsd:complexType name="bookType">
 <xsd:element name="title" type="xsd:string"/>
 <xsd:element name="author" type="xsd:string"/>
 <xsd:attribute name="isbn" type="isbnType"/>
</xsd:complexType>
<xsd:simpleType name="isbnType">
 <xsd:restriction base="xsd:string">
   <xsd:pattern value="\[0-9]{3}[-][0-9]{3}[-][0-9]{3}"/>
 </xsd:restriction>
</xsd:simpleType>

So the definition of the bookType is not particularly interesting. But the definition of its attribute "isbn" is. Not only does XML Schema support the use of types such as xsd:nonNegativeNumber, but we can also create our own simple types from these basic types using various modifiers. In the example for isbnType above, we base it on a string, and restrict it to match a given regular expression. Excusing my poor regex, that should limit any isbn attribute to match the standard of three groups of three digits separated by a dash.
This is just a simple example, but it should give you a taste of the many things you can do to control the content of an attribute or an element. You have far more control over what is considered a valid XML document using a schema. You can even
  • extend your types from other types you've created,
  • require uniqueness within scope, and
  • provide lookups.
It's a nicely object oriented approach. You could build a library of complexTypes and simpleTypes for re-use throughout many projects, and even find other definitions of common types (such as an "address", for example) from the Internet and use these to provide powerful definitions of your XML documents.
DTD vs XML Schema
The DTD provides a basic grammar for defining an XML Document in terms of the metadata that comprise the shape of the document. An XML Schema provides this, plus a detailed way to define what the data can and cannot contain. It provides far more control for the developer over what is legal, and it provides an Object Oriented approach, with all the benefits this entails.
So, if XML Schemas provide an Object Oriented approach to defining an XML document's structure, and if XML Schemas give us the power to define re-useable types such as an ISBN number based on a wide range of pre-defined types, why would we use a DTD? There are in fact several good reasons for using the DTD instead of the schema.
Firstly, and rather an important point, is that XML Schema is a new technology. This means that whilst some XML Parsers support it fully, many still don't. If you use XML to communicate with a legacy system, perhaps it won't support the XML Schema.
Many systems interfaces are already defined as a DTD. They are mature definitions, rich and complex. The effort in re-writing the definition may not be worthwhile.
DTD is also established, and examples of common objects defined in a DTD abound on the Internet -- freely available for re-use. A developer may be able to use these to define a DTD more quickly than they would be able to accomplish a complete re-development of the core elements as a new schema.
Finally, you must also consider the fact that the XML Schema is an XML document. It has an XML Namespace to refer to, and an XML DTD to define it. This is all overhead. When a parser examines the document, it may have to link this all in, interperate the DTD for the Schema, load the namespace, and validate the schema, etc., all before it can parse the actual XML document in question. If you're using XML as a protocol between two systems that are in heavy use, and need a quick response, then this overhead may seriously degrade performance.
Then again, if your system is available for third party developers as a Web service, then the detailed enforcement of the XML Schema may protect your application a lot more effectively from malicious -- or just plain bad -- XML packets. As an example, Muse.net is an interesting technology. They have a publicly-available SOAP API defined with an XML Schema that provides their developers more control over what they receive from the user community.
On the other hand, I was recently involved in designing a system to handle incoming transactions from multiple devices. In order to scale the system, the chosen service that processes requests is a SOAP server. However, the system is completely closed, and a simple DTD on the server is enough to ensure that the packets sent from the clients arrive complete and uncorrupted, without the additional overhead of XML Schema.
    for details  http://www.brics.dk/~amoeller/XML/schemas

No comments:

Post a Comment