Pages

Saturday, October 16, 2010

CSS

XML and CSS

The simplicity of document creation was a key element in the astonishingly rapid development of the Web. This article describes XML and CSS: the "one-two" punch that will not only bring back that level of simplicity, but also enable the construction of complex applications which are either difficult or impossible using HTML. In this article we outline the steps for using an CSS style sheet in an XML document; we discuss the limitations of CSS in complex applications; and we present a real life example.
HTML provides limited possibilities for the explicit formatting and positioning of text. The mechanisms that are provided--such as the FONT element or the ALIGN attribute--force the page designer to embed presentation-specific information within the document; a fact that makes it difficult to prepare documents for a variety of screen sizes, presentation modalities, and types of audiences. Because these limited features are not sufficient to achieve the formatting results desired by many Web designers, they commonly resort to using tables and various HTML coding "tricks." This presents many negative consequences, particularly because it is so difficult to maintain information content in HTML documents; the content is inextricably interwined with the format-related encoding. More sophisticated formatting capabilities have long been needed to support the many document types, ranging from marketing froufrou to legal documents to scientific journals.
Cascading Style Sheets (CSS) is a style sheet mechanism specifically developed to meet the needs of Web designers and users. CSS provides HTML with far greater control over document presentation in a way that is independent of document content. CSS style sheets can be used to set fonts, colors, whitespace, positioning, backgrounds, and many other presentational aspects of a document. It is also possible for several documents to share the same style sheet, which allows users to maintain consistent presentation within a collection of related documents without having to modify each document separately.
The rationale for XML is discussed at length in other papers in this issue; we will thus restrict our introduction to the observation that, in combination, XML and CSS can once again simplify document creation. XML uses markup to describe the structure and data content of a document, making it easy both for authors to write it and for computer programs to process it. CSS, on the other hand, makes it possible to present that document to the user in a browser. CSS or some type of style sheet mechanism is, in fact, a requisite for browsing XML on the Web. If CSS works well with HTML, it will work wonders with XML.
In this article we will illustrate the use of CSS and XML with two examples. We do not describe the syntax of CSS, but refer the reader to the CSS documentation on the W3C Web site.[1]

A Thespian Example

The steps in the following sections illustrate the use of a CSS style sheet for an XML document (an extract from the Shakespeare play "Much Ado about Nothing").


Figure 1

Step 1: The Document Source

Figure 1 shows much_ado.xml; preparing the document source is the first step in using a CSS style sheet for an XML document.

Step 2: Define Style Sheet Rules

The style rules for a document can be stored either within the document itself, or within a separate text file. In this example, we will save the style sheet as a separate file, shaksper.css, so that it can be easily applied to other Shakespeare plays we might like to publish.
You can create the style sheet manually using your favorite text editor, as shown in Example 1.
Alternatively, you can create the style sheet using an editor that supports CSS, such as Grif's Symposia, as shown in Figure 2.
Example 1
PLAY { background-color : white }
FM { font-style : italic;
     font-size : 14;
     color : #400040;
     text-align : right }
SPEAKER { font-weight : bold;
          color : #ff0080 }
LINE { color : #800040;
       left : 15 }
PERSONA { font-style : italic;
          font-size : -1 }
PERSONAE { color : #800040 }
SCNDESCR { margin-top : 30;
           font-size : 18;
           color : #0000a0;
           height : 20 }
STAGEDIR { font-weight : bold;
           font-style : italic;
           height : 20 }
PLAYSUBT { font-weight : bold;
           font-size : 16;
           text-decoration : underline;
           height : 20 }
SPEECH { margin-top : 5 }



Figure 2

Step 3: Link the Style Sheet
to the Document

At the time of writing, there is not yet a standardized method for linking an XML document with a style sheet. This subject forms part of the work, currently in progress by the W3C, that is due to be published in December 1997 as part III of the XML specification: XML-Style.
The method we will use in this example therefore is based on a draft proposal for stylesheet linking in XML which consists of inserting the XML processing instruction <?XML-stylesheet?> at the top of the document. The processing instruction has two required attributes type and href which respectively specify the type of stylesheet and its address. In our example, we thus need to add the following line to our XML document:

<?XML-stylesheet type="text/css" href="shaksper.css"?> 
Symposia supports this XML style sheet linking mechanism and inserts the processing instruction for you automatically when you specify an external style sheet using the "Create new Style Sheet/External" command.

Step 4: Publish the Document
and its Style Sheet

Once the style sheet is linked to the document, it can be published. When opened in an XML browser (or in Symposia, as shown in Figure 3), the style rules are applied to the different XML elements in the document.
Figure 3


 

Limitations of CSS for
Complex Applications

Although CSS style sheets can be very effective for improving the presentation of HTML documents, the CSS1 standard has a number of important omissions which can limit the effectiveness of CSS style sheets for more complex applications. The following list, taken from Jon Bosak's presentation at WWW6 on April 11, 1997[2] describes just a few of the major limitations of the CSS standard:

  • CSS cannot grab an item (such as a chapter title) from one place and use it again in another place (such as a page header).
  • CSS has no concept of sibling relationships. For example, it is impossible to write a CSS stylesheet that will render every other paragraph in bold.
  • CSS is not a programming language; it does not support decision structures and cannot be extended by the stylesheet designer.
  • CSS cannot calculate quantities or store variables. This means, at the very least, that it cannot store commonly used parameters in one location that is easy to update.
  • CSS cannot generate text (page numbers, etc.)
  • CSS uses a simple box-oriented formatting model that works for current Web browsers but will not extend to more advanced applications of the markup, such as multiple column sets.
  • CSS is oriented toward Western languages and assumes a horizontal writing direction.
It is for these reasons that the W3C working group responsible for the XML standard is concentrating its efforts on the implementation of a simplified version of the DSSSL standard, DSSSL Online (DSSSL-O). DSSSL-O promises to provide far more layout and document presentation features than CSS, but is a considerably more complex standard and may prove difficult to implement.
What is clear, however, is that the limitations of CSS1 do represent a serious hinderance for its use in the more complex types of application that are possible with XML. As the following section shows, even for a somewhat simple XML application, the CSS standard needs to acquire a certain number of essential features if it is realize its full potential.

A Real Life Example

The following example is taken from a Marketing Contact application which is actually in use at Grif. The Marketing Contact application allows our staff to record information and update information about our business contacts, and to extract reports and find existing records based on specific criteria using a search engine. Though this type of application is often implemented using a database such as Access, a more Web-centric approach would be to create an interface to a database using HTML forms. We choose instead to simply create our database records directly using Symposia. Our search engine is XML-capable so the equivalent of field-specific searches can be performed within XML elements. Among the advantages of this approach is that our application is entirely Intranet-based; and as a result we were neither required to write a line of CGI code, nor to develop HTML or database forms.
The DTD for our application is shown in Example 2.
Example 2
<!XML version="1.0">
<!DocType ContactRec [
<!Element ContactRec (Name, Company, Address, 
Product, Contacts)>
<!Element Name (Honorific?, First,
Middle?, Last)>
<!Element Company (JobTitle?, CompanyName)>
<!Element Address (Street+, City, Region?,
PostCode, Country, Phone,
Internet)>
<!Element Product  Empty>
<!AttList Product
SGMLEditor (Yes|No) #REQUIRED
SGMLEditorKorean (Yes|No) #REQUIRED
SGMLEditorJapanese (Yes|No) #REQUIRED
ActiveViews (Yes|No) #REQUIRED
SymposiaPro (Yes|No) #REQUIRED
SymposiaDocPlus (Yes|No) #REQUIRED
XMLProducts (Yes|No) #REQUIRED
General (Yes|No) #REQUIRED>
<!Element Contacts (Language, History)>
<!Element Honorific (#PCDATA)>
<!AttList Honorific
Title (Mr|Ms|Mrs|Miss|Dr|Professor|M|Mme|Mlle|SeeContent) "SeeContent">
<!Element First (#PCDATA)>
<!Element Middle (#PCDATA)>
<!Element Last (#PCDATA)>
<!Element JobTitle (#PCDATA)>
<!Element CompanyName (#PCDATA)>
<!Element Street (#PCDATA)>
<!Element City (#PCDATA)>
<!Element Region (#PCDATA)>
<!Element PostCode (#PCDATA)>
<!Element Country (#PCDATA)>
<!Element Phone (DayTime, Fax?)>
<!Element Internet (Email, Web)>
<!Element Language  EMPTY>
<!AttList Language
Preference (English|French) "English">
<!Element History (Events+)>
<!Element DayTime (#PCDATA)>
<!Element Fax (#PCDATA)>
<!Element Email (#PCDATA)>
<!Element Web (#PCDATA)>
<!Element Events (Date, Venue, Notes)>
<!Element Date (Day, Month, Year)>
<!Element Venue (#PCDATA)>
<!Element Notes (#PCDATA)>
<!Element Day (#PCDATA)>
<!Element Month (#PCDATA)>
<!Element Year (#PCDATA)>
]>
A form-based interface provides many user-friendly features that make data-entry easier for the user, such as the ability to select from a predetermined list of options, or the possibility to select or unselect different options, as appropriate.
To provide an equivalent level of "comfort," we need to provide a predefined document template in which the user need only fill in the required data. Using the DTD as our guide, we can produce a document instance that contains all required document elements. Without a style sheet, however, this template is not of great value. The editor will simply display the document as a list of tags. Users need to know what information to insert where without displaying these tags--the only way we can accomplish this is by using our style sheet.
With the HTML form, it was possible to simply label each input zone with some text. Unfortunately, the CSS standard does not provide a means of displaying predefined text before or after an element. This style sheet feature was left out of CSS1 because of the inherent difficulty of implementing this feature in today's Web browsers (which, of course, by the time you read these words, will already be yesterday's Web browsers). Structure based editor/browsers such as Symposia, or indeed the thoroughbred XML structured editors to come, would have no problem implementing this type of feature, if it were included in the CSS standard.
The only viable solution for the moment seems to be to provide default text content for each data field that is then replaced by the user as he/she enters the information in the page (see Figure 4).
Figure 4


This layout is still not very satisfactory, however. Figure 5 improves the layout of the page using our style sheet to highlight certain elements and group related items together.
While we can achieve quite a pleasing presentation for our data entry screen using the style sheet. However there are two elements in the DTD which we cannot display in this same way: Product and Language. These two elements are empty and have no text content.

  • The Product element takes a number of optional attributes to indicate the contact's interest in the company's different products (the value of each attribute is "Yes" or "No," accordingly.
  • The Language element takes the Preference attribute, which can take the values "English" or "French," and for which the default value is "English"--indicating that any communication with or documentation sent to the contact should be in English.
One way around this problem, although not very elegant, would be to specify a background image for each element using the following style rule:

Product {background-image:image.gif} 
Using our authoring tool, we would then be able to select the element by clicking on the image and pull up a list of the element's attributes for modification, as shown in Figure 6.
Figure 5


Figure 6


Of course, we are still left with the following problem: using CSS, it is not possible to display the attribute names or values for an element in the document itself. In our example, it would have been nice to be able to pull up the list of attributes for the Product element, supply a value for each attribute, then see these changes appear in the document (display some text or an image to indicate which products the contact was interested in, for example). Neither does CSS provide the possibility to apply conditional style rules based on an element's attribute values (one might imagine a different image to be displayed for "Yes" or for "No" values).
Despite the limitations of CSS described here we have found that the language comes close to meeting our needs. This, combined with the fact that CSS is already implemented for HTML in the major Web browsers--and our sense that the simplicity of CSS will appeal to Web designers over more complex (albeit powerful) approaches--leads us to believe that CSS will be the dominant mechanism for displaying XML documents on the Web.

Table of contents

For further notes visit

http://www.w3schools.com/xml/default.asp
http://developer.apple.com/internet/webcontent/xmltransformations.html
http://www.quackit.com/xml/tutorial/xml_css.cfm

Wednesday, October 13, 2010

DOM AND SAX PARSERS

The Document Object Model (DOM) is a cross-platform and language-
independent convention for representing and interacting with objects in
HTML, XHTML and XML documents. Aspects of the DOM (such as its "Elements")
may be addressed and manipulated within the syntax of the programming l
anguage in use. The public interface of a DOM is specified in its
Application Programming Interface (API).
The DOM is a programming interface for HTML and XML documents.
It defines the way a document can be accessed and manipulated.
Using a DOM, a programmer can create a document, navigate its structure,
and add, modify, or delete its elements.
As a W3C specification, one important objective for the DOM has been to
provide a standard programming interface that can be used in a wide variety
 of environments and applications

What is the DOM?

The W3C Document Object Model (DOM) is a platform and language
 neutral interface that allow programs and scripts to dynamically access
and update the content,style and structure of a document."
The W3C DOM provides a standard set of objects for representing HTML
and XML documents, and a standard interface for accessing and manipulating
them.
The DOM is separated into different parts (Core,XML and HTML) and
also have different levels (DOM Level 1/2/3):
  • Core DOM - define a standard set of objects for any structured documents
  • XML DOM - defines a standard set of objects for XML documents only.
  • HTML DOM - define a standard set of objects for HTML documents only

What is the XML DOM?

  • The XML DOM is the Document Object Model for XML only.
  • The XML DOM is language- and platform-independent
  • The XML DOM define a standard set of objects for XML
  • The XML DOM define a standard way to access XML documents
  • The XML DOM define a standard way to manipulate XML documents
  • The XML DOM is W3C standard
The DOM view XML documents as a tree-structure. All elements; their containing
text and their attributes, can be accessed through the DOM tree. Their contents
can be deleted or modified, and new elements can be created. The elements,
their text, and their attributes are all known as nodes.
The XML DOM is:
  • A standard object model for XML
  • A standard programming interface for XML
  • Platform- and language-independent
  • A W3C standard
The XML DOM defines the objects and properties of all XML elements, and the methods (interface) to access them.
In other words: The XML DOM is a standard for how to get, change, add, or delete XML elements.




What You Should Already Know

Before you goes to this tutorial you should have a basic understanding of the following:
  • HTML / XHTML
  • XML
  • JavaScript
                                                                    

XML DOM Tutorial


DOM Nodes
DOM Node Tree
DOM Parser
DDOM Methods
DOM Accessing
DOM Node Info
DOM OM Load Function
Node List
DOM Traversing
DOM Browsers
DOM Navigating

Manipulate Nodes

XML DOM Reference

DOM Node Types
DOM Node
DOM NodeList
DOM NamedNodeMap
DOM Document
DOM DocumentImpl
DOM DocumentType
DOM ProcessingInstr
DOM Element
DOM Attribute
DOM Text
DOM CDATA
DOM Comment
DOM XMLHttpRequest
DOM ParseError Obj
DOM Parser Errors
DOM Summary

XML DOM Examples
DOM Examples
DOM Validator
                          SAX PARSER
What is SAX?

The Simple API for XML, SAX, was invented in late 1997/early 1998 when Peter Murray-Rust and several authors of XML parsers written in Java decided there wasn’t much point to maintaining multiple similar yet incompatible APIs to do exactly the same thing. Murray-Rust was the first to suggest what he called “YAXPAPI”. The reason Murray-Rust wanted Yet Another XML Parser API was that he was thoroughly sick of supporting multiple, incompatible XML parsers for his parser-client application JUMBO. Instead, he wanted a standard API everyone could agree on. Parser authors Tim Bray and David Megginson quickly signed on to the project, and work began in public on the xml-dev mailing list where many people participated. Megginson wrote the initial draft of SAX. After a short beta period, SAX 1.0 was released on May 11, 1998.
SAX was designed around abstract interfaces rather than concrete classes so it could be layered on top of parsers’ existing native APIs. SAX is not the most sophisticated XML API imaginable, but that’s part of its beauty. The ease with which SAX could be implemented by many parser vendors with very different architectures contributed to its success and rapid standardization.
SAX (Simple API for XML) is a sequential access parser API for XML. SAX provides a mechanism for reading data from an XML document. It is a popular alternative to the Document Object Model (DOM).

Introduction

This web page publishes SAX Parser code that reads XML formatted data into Java objects. A class is included that will allocate and initialize the SAX Parser. If a boolean flag is true, the parser will be initialized as a validating parser. The XML schema that the XML documents are validated against is published here as well.

SAX: Ass Backward Parsing

With SAX and XML Schema validation as examples, I am left with the impression that the people who developed these technologies never took a compiler implementation class, or if they did, the class left no impression on them.
Parsing is usually done by two logical components: a parser and a scanner. The scanner reads the text and classifies it as "tokens". A token is a catagory that is recognized by the parser. For example, a scanner for the Java programming language might return the tokens that include: identifier, integer, for (a reserved word), mult (an operator). An important point, relative to SAX, is that the parser calls the scanner. As the parser processes the tokens returned by the scanner it performs operations, like building a syntax tree. An example of a parser that reads assignment statements and arithmetic expressions and builds XML can be found here. The is part of the DOM parsing software mentioned above.
In the case of SAX, the scanner (the SAXParser object) calls the parser. This makes parsing with SAX needlessly awkward and complicates the architecture of the software. For this reason, the DOMParser is frequently used for parsing complicated XML documents.

SAX is not without its virtues (maybe)

The SAXParser does have two notable advantages over the DOMParser: the SAXParser is faster and it uses less memory. While the SAXParser is difficult to use for processing complex XML documents, perhaps it is appropriate for processing simple XML documents? This web page grew out of an experiment to see if this is true.

A Prototype Application

The prototype code published on this web page is motivated by a real application. This is a software system I call a Trade Engine, which is diagrammed in Figure 1. The Trade Engine is designed to process order and control messages for trading applications. These might be computer driven trading programs for the stock, options or foreign exchange markets. The trading applications submit XML formatted orders and control messages to the Trade Engine. The Trade Engine parses and validates these messages and builds internal Java objects. The market orders are called "aim orders" because they specify a trading goal. Depending on the processing instructions, the Trade Engine may execute the order over a period of time (e.g., the trading day).

Figure 1

Parsing Trade Engine Messages using SAX

SAX uses "call backs". When the SAXParser object recognizes a component in an XML document (e.g., a start Element, an end Element, the characters between tags), it calls a method that may be supplied by the application to process the XML component. In Java this is done by subclassing a handler class, like the DefaultHandler. This can be seen in the method signature in the javax.xml.parsers.SAXParser object for the parse method used in this example:
parse(InputStream is, DefaultHandler dh) 
In this example a MessageProcessor subclass is derived from the DefaultHandler class. The MessageProcessor class overrides the methods associated with the XML components that are of interest. For example, startElement is overridden, but the processingInstruction() method is not. The MessageProcessor class is diagrammed in Figure 2.





Figure 2
The result of the SAXParser calling the methods overridden by the MessageProcessor class is to build an object from the data. All Trade Engine objects share a common set of fields indicating the product that sent the message, the user and a local ID that is used by the product. The specific messages have unique data that is associated with that message type. This is experimental prototype code, so only two message types, Control and AimOrder are supported. This is shown in the Figure 3.

Figure 3
A Trade Engine message may enclose multiple sub-messages. For example, one Trade Engine message may include multiple aim orders. When the MessageProcessor class recognizes the start of a sub-message it allocates a sub-message processor. The MessageProcessor then calls the sub-message processor methods to process each of the XML components and build the object. The class diagram for the sub-message processors is shown in Figure 4. Again, this is just a prototype, so there are only two message processors.

Figure 4
The MessageBaseMessage base class processes the common data fields in the MessageBase object, which is the base class for the Control and AimOrder objects.

Conclusion

The software published here builds message objects from XML formatted data. In theory using the SAXParser for this is faster than using the DOMParser to build a DOM object and then traversing the DOM tree to build a message object. But the call back architecture of SAX introduces complexity that does not exist for a parser which calls a scanner. The awkwardness of SAX and the overhead of DOMParsing are some of the motivations behind the XML Pull Parser, which is called by a parsing application. An example that applies XML Pull Parsing to the Trade Engine messages described on this web page can be found here.
One advantage that both the SAX and DOM parsers have is that they are validating. The structure of the XML document can be verified against an XML schema. However, the computational cost of this validation is unknown (at least to me). SAX validation may reduce the computational advantage of the SAX parser compared to the DOM parser.

At its core, SAX, the Simple API for XML, is based on just two interfaces, the XMLReader interface that represents the parser and the ContentHandler interface that receives data from the parser. These two interfaces alone suffice for 90% of what you need to do with SAX. This chapter shows the basic operation of XMLReader and discusses ContentHandler in detail. The next chapter explores a variety of ways to customize the parsing process through the more advanced features of the XMLReader interface.



Saturday, September 25, 2010

DTDs and XML Schemas

What DTDs and XML Schemas 

Document Type Definition (DTD) is a set of markup declarations that define a document type for SGML-family markup languages (SGML, XML, HTML). DTDs were a precursor to XML schema and have a similar function, although different capabilities.
DTDs use a terse formal syntax that declares precisely which elements and references may appear where in the document of the particular type, and what the elements’ contents and attributes are. DTDs also declare entities which may be used in the instance document.
XML uses a subset of SGML DTD.

Document Type Definitions and XML Schemas both provide descriptions of document structures. The emphasis is on making those descriptions readable to automated processors such as parsers, editors, and other XML-based tools. They can also carry information for human consumption, describing what different elements should contain, how they should be used, and what interactions may take place between parts of a document. Although they use very different syntax to achieve this task, they both create documentation.

Perhaps the most important thing DTDs and XML Schemas do is set expectations, using a formal vocabulary and other information to lay ground rules for document structures. Two parsers, given a document and a DTD, should have the same opinions about whether that document is valid, and different schema processors should similarly agree on whether or not a document conforms to the rules in a given schema. XML editing applications can use DTDs and schemas as frameworks, letting users create documents that meet these expectations. Similarly, developers can use DTDs and XML Schemas as a foundation on which to plan transformations from one format to another. By agreeing to a given DTD or schema, a group of developers has accepted a set of rules about document vocabulary and structure. While this doesn't solve all the problems of application development, it does at least mean that independent development of tools that process these documents is a lot easier.
Schemas and DTDs provide a number of additional functions that make contributions to document content:
  • Providing defaults for attributes: in addition to providing constraints on attribute content, DTDs and XML Schemas allow developers to specify default values that should be used if no value was set in the content explicitly.
  • Entity declaration: DTDs and XML Schemas provide for the declaration of parsed entities, which can be referenced from within documents to include content.
Schemas and DTDs may also describe "notations" and "unparsed entities", adding information to documents that applications may use to interpret their content.

Where DTDs and Schemas Come From

The main thrust of development work, initially for XML 1.0 and its DTDs, and now for XML Schemas, is taking place at the World Wide Web Consortium (W3C). However, the W3C is not the only source for schema languages. At least five other schema proposals have been developed and many of them are in actual use -- notably, Microsoft's XML-Data, which is used for its BizTalk initiative. Most of these proposals are feeding into the main W3C-sanctioned development process. The main contenders in the schema arena, including DTDs, are listed below:
  • DTDs - Document Type Definitions were originally developed for XML's predecessor, SGML. They use a very compact syntax and provide document-oriented data typing. XML DTDs are a subset of those available in SGML, and the rules for using XML DTDs provide much of the complexity of XML 1.0. Complete XML DTD support is (or should be) built into all validating XML parsers, and some XML DTD support is built into all XML parsers.
  • XML-Data/XML-Data Reduced - Based on a proposal that Microsoft and others submitted to the W3C even before XML 1.0 was completed, this schema proposal is used in Microsoft's BizTalk framework. XML-Data provides a large set of data types more appropriate to database and program interchange. XML-Data support is built into Microsoft's XML parser.
  • Document Content Description (DCD) - Created in a joint effort between IBM and Microsoft, DCD uses some ideas from XML-Data and some syntax from another W3C project, Resource Description Framework (RDF).
  • Schema for Object-Oriented XML (SOX) - SOX was developed by Veo Systems (now acquired by CommerceOne) and provides functionality like inheritance to XML structures. SOX has gone through multiple versions. The latest is SOX version 2.
  • Document Description Markup Language (DDML) - DDML was developed on the XML-dev mailing list, creating a schema language with a subset of DTD functionality. Development of DDML (which was once known as XSchema) has halted since the W3C Activity began.
Although you can start work with any of the above tools today -- DTDs being widely supported -- when the specification is complete, using the W3C XML Schemas is probably the safest long-term solution. Fortunately, converting among different schema formats isn't especially difficult, and tools are available to help you in the process.

How Schemas Differ from DTDs

The first, and probably most significant, difference between XML Schemas and XML DTDs is that XML Schemas use XML document syntax. While transforming the syntax to XML doesn't automatically improve the quality of the description, it does make those descriptions far more extensible than they were in the original DTD syntax. Declarations can have richer and more complex internal structures than declarations in DTDs, and schema designers can take advantage of XML's containment hierarchies to add extra information where appropriate -- even sophisticated information like documentation. There are a few other benefits from this approach. XML Schemas can be stored along with other XML documents in XML-oriented data stores, referenced, and even styled, using tools like XLink, XPointer, and XSL.
The largest addition XML Schemas provide to the functionality of the descriptions is a vastly improved data typing system. XML Schemas provide data-oriented data types in addition to the more document-oriented data types XML 1.0 DTDs support, making XML more suitable for data interchange applications. Built-in datatypes include strings, booleans, and time values, and the XML Schemas draft provides a mechanism for generating additional data types. Using that system, the draft provides support for all of the XML 1.0 data types (NMTOKENS, IDREFS, etc.) as well as data-specific types like decimal, integer, date, and time. Using XML Schemas, developers can build their own libraries of easily interchanged data types and use them inside schemas or across multiple schemas.
The current draft of XML Schemas also uses a very different style for declaring elements and attributes to DTDs. In addition to declaring elements and attributes individually, developers can create models -- archetypes -- that can be applied to multiple elements and refined if necessary. This provides a lot of the functionality SOX had developed to support object-oriented concepts like inheritance. Archetype development and refinement will probably become the mark of the high-end schema developer, much as the effective use of parameter entities was the mark of the high-end DTD developer. Archetypes should be easier to model and use consistently, however.
XML Schemas also support namespaces, a key feature of the W3C's vision for the future of XML. While it probably wouldn't be impossible to integrate DTDs and namespaces, the W3C has decided to move on, supporting namespaces in its newer developments and not retrofitting XML 1.0. In many cases, provided that namespace-prefixes don't change or simply aren't used, DTD's can work just fine with namespaces, and should be able to interoperate with namespaces and schema processing that relies on namespaces. There will be a few cases, however, where namespaces may force developers to use the newer schemas rather than the older DTDs.

Alternative Approaches

As exciting as XML Schemas are, there have been a few suggestions for very different approaches that also hold promise. Both Rick Jelliffe's Schematron and the Document Structure Description (DSD), from AT&T Labs and the University of Aarhus, look at documents from a more complex perspective than containment, and use tools derived from style languages -- Schematron is based on XSL, while DSD works from CSS -- to examine documents more closely.
Schematron allows developers to ask about the existence and contents of paths through documents rather than specify containment structures, and places great importance on producing human-readable results. Schematron processing, which can use XSL tools, can produce complete reports on the content and structure of documents, rather than a simple yes/no validation with error reporting.
DSD comes from somewhat similar origins, but uses its own vocabulary to create document descriptions rather than building on the XSL processing model. DSD schemas look much more like the W3C's XML Schemas, but support a different set of tests and have a much greater focus on tasks like providing default content for attributes and elements. DSD allows for context-sensitive rules, where the required usage of a given element may change depending on how it is used in a document. Attributes which are optional in one context may be required in another context. Declarations may impose order on some elements but not on others, making it possible to create 'floating' elements. An open-source implementation in C is available, which adds error information to the document as it is processed, giving applications or users a chance to react to the errors.
It isn't clear at this point whether these approaches will be integrated with XML Schemas at some level, or if they'll be useful tools for supplementing or replacing XML Schemas on particular kinds of projects. In any case, both of these projects are worth further investigation.

Planning Around DTDs and Schemas

Transitioning from one technology to another is often difficult, but at least the transition from DTDs to schemas only involves descriptions of documents, requiring only minor changes to the documents themselves. It is uncertain if it's time yet to begin the transition, as the latest public draft of XML Schemas came with a warning on the XML-dev mailing list that there may be significant changes in future drafts. XML Schemas are still far from stable, so probably only the most enthusiastic early adopters should be considering them at this point.
Although XML Schemas may not yet be ready, XML-based projects should be prepared for their eventual arrival. There are several strategies for handling this transition that may be appropriate to different kinds of projects and different developer needs.
  • Develop DTDs with an eye toward future conversion to schemas. Automated tools for converting among schema formats, like Extensibility's XML Authority, are already available and are likely to grow to include the final W3C XML Schemas.
  • Use other schema formats, like XML-Data and SOX. This lets developers take advantage of features like data typing immediately, and conversions from these experimental schema formats to the new XML Schemas shouldn't be prohibitively difficult.
  • Create well-formed documents for now, ignoring DTDs and schemas in their current incarnation. It's not always easy to retrofit a schema onto a set of documents, but it may be appropriate for some cases where the format of existing data sources (like databases) ensures that there's won't be wild variations in structure. When schemas arrive, you can add them to your processing.
  • Ignore DTDs and schemas completely, and only work with well-formed documents. If you don't need structure checking, this may be a perfectly appropriate strategy.
  • Plan to stick to DTDs. They're here now, they'll be here later. If your XML has to be processed by SGML tools, this may be the best route. Keeping your DTDs around, even if you supplement them with equivalent XML Schemas, will preserve interoperability.
There is no single answer for handling this transition that applies to all XML projects. If all your XML work involves documents, DTDs may be a perfectly adequate tool for your needs, and schemas might only be a distraction. If you're trying to manage data interchange between databases of different kinds, the data typing functionality that schemas provide may drive you to use XML-Data or SOX today, and XML Schemas when they arrive.

The Future for DTDs and Schemas

Right now there are too many options for describing your data, but in the future, they will probably slim down to: DTDs, for legacy XML 1.0 applications and integration with SGML; XML Schemas, and plain old well-formed documents for situations where describing document structures is unnecessary or counterproductive. Whatever you do with DTDs and XML Schemas, remember that their usage should be considered a part of document format specification and documentation. Where documentation is important, these tools will be important, both to set expectations and spare applications the task of checking document structures themselves.
The DSD and Schematron approaches will probably receive more attention in future development as well; Schematron is already an easy and useful supplement to both DTD and XML Schema processing. Both of these tools provide functionality that goes beyond anything the W3C has currently released, demonstrating that there are multiple useful approaches to describing document structures. While it seems unlikely that developers will want to create a DTD, an XML Schema, a Schematron schema, and a DSD, all for the same document, they are all important new tools in the XML developer's toolkit.

XML is a very handy format for storing and communicating your data between disparate systems in a platform-independent fashion. XML is more than just a format for computers -- a guiding principle in its creation was that it should be Human Readable and easy to create.
XML allows UNIX systems written in C to communicate with Web Services that, for example, run on the Microsoft .NET architecture and are written in ASP.NET. XML is however, only the meta-language that the systems understand -- and they both need to agree on the format that the XML data will be in. Typically, one of the partners in the process will offer a service to the other: one is in charge of the format of the data.
The definition serves two purposes: the first is to ensure that the data that makes it past the parsing stage is at least in the right structure. As such, it's a first level at which 'garbage' input can be rejected. Secondly, the definition documents the protocol in a standard, formal way, which makes it easier for developers to understand what's available.
DTD - The Document Type Definition
The first method used to provide this definition was the DTD, or Document Type Definition. This defines the elements that may be included in your document, what attributes these elements have, and the ordering and nesting of the elements.
The DTD is declared in a DOCTYPE declaration beneath the XML declaration contained within an XML document:
Inline Definition:

<?xml version="1.0"?>
<!DOCTYPE documentelement [definition]>

External Definition:
<?xml version="1.0"?>
<!DOCTYPE documentelement SYSTEM "documentelement.dtd">

The actual body of the DTD itself contains definitions in terms of elements and their attributes. For example, the following short DTD defines a bookstore. It states that a bookstore has a name, and stocks books on at least one topic.
Each topic has a name and 0 or more books in stock. Each book has a title, author and ISBN number. The name of the topic, and the name of the bookstore are defined as being the same type of element: this store's PCDATA: just text data. The title and author of the book are stored as CDATA -- text data that won't be parsed for further characters by the XML parser. The ISBN number is stored as an attribute of the book:

<!DOCTYPE bookstore [
 <!ELEMENT bookstore (topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
An example of a book store's inline definition might be:
<?xml version="1.0"?>
<!DOCTYPE bookstore [
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

Using an inline definition is handy when you only have a few documents and they're offline, as the definition is always in the file. However, if, for example, your DTD defines the XML protocol used to talk between two seperate systems, re-transmitting the DTD with each document adds an overhead to the communciations. Having an external DTD eliminates the need to re-send each time. We could remove the DTD from the document, and place it in a DTD file on a Web server that's accessible by the two systems:

<?xml version="1.0"?>
<!DOCTYPE bookstore SYSTEM "http://webserver/bookstore.dtd">
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

The file bookstore.dtd would contain the full defintion in a plain text file:
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">

The lowest level of definition in a DTD is that something is either CDATA or PCDATA: Character Data, or Parsed Character Data. We can only define an element as text, and with this limitation, it is not possible, for example, to force an element to be numeric. Attributes can be forced to a range of defined values, but they can't be forced to be numeric.
So for example, if you stored your applications settings in an XML file, it could be manually edited so that the windows start coordinates were strings -- and you'd still need to validate this in your code, rather than have the parser do it for you.
XML Schemas
XML Schemas provide a much more powerful means by which to define your XML document structure and limitations. XML Schemas are themselves XML documents. They reference the XML Schema Namespace (detailed here [1]), and even have their own DTD [2].
What XML Schemas do is provide an Object Oriented approach to defining the format of an XML document. XML Schemas provide a set of basic types. These types are much wider ranging than the basic PCDATA and CDATA of DTDs. They include most basic programming types such as integer, byte, string and floating point numbers, but they also expand into Internet data types such as ISO country and language codes (en-GB for example). A full list can be found here [3].
The author of an XML Schema then uses these core types, along with various operators and modifiers, to create complex types of their own. These complex types are then used to define an element in the XML Document.
As a simple example, let's try to create a basic XML Schema for defining the bookstore that we used as an example for DTDs. Firstly, we must declare this as an XSD Document, and, as we want this to be very user friendly, we're going to add some basic documentation to it:

xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
 <xsd:documentation xlm:lang="en">
   XML Schema for a Bookstore as an example.
 </xsd:documentation>
</xsd:annotation>


Now, in the previous example, the bookstore consisted of the sequence of a name and at least one topic. We can easily do that in an XML Schema:

<xsd:element name="bookstore" type="bookstoreType"/>
<xsd:complexType name="bookstoreType">
 <xsd:sequence>
   <xsd:element name="name" type="xsd:string"/>
   <xsd:element name="topic" type="topicType" minOccurs="1"/>
 </xsd:sequence>
</xsd:complexType>


In this example, we've defined an element, bookstore, that will equate to an XML element in our document. We've defined it of type bookstoreType, which is not a standard type, and so we provide a definition of that type next.
We then define a complexType, which defines bookstoreType as a sequence of name and topic elements. Our "name" type is an xsd:string, a type defined by the XML Schema Namespace, and so we've fully defined that element.
The topic element, however, is of type topicType, another custom type that we must define. We've also defined our topic element with minOccurs="1", which means there must be at least one element at all times. As maxOccurs is not defined, there no upper limit to the number of elements that might be included. If we had specified neither, the default would be exactly one instance, as is used in the name element. Next, we define the schema for the topicType.

<xsd:complexType name="topicType">
 <xsd:element name="name" type="xsd:string"/>
 <xsd:element name="book" type="bookType" minOccurs="0"/>
</xsd:complexType>


This is all similar to the declaration of the bookstoreType, but note that we have to re-define our name element within the scope of this type. If we'd used a complex type for name, such as nameType, which defined only an xsd:string -- and defined it outside our types, we could re-use it in both. However, to illustrate the point, I decided to define it within each section. XML gets interesting when we get to defining our bookType:

<xsd:complexType name="bookType">
 <xsd:element name="title" type="xsd:string"/>
 <xsd:element name="author" type="xsd:string"/>
 <xsd:attribute name="isbn" type="isbnType"/>
</xsd:complexType>
<xsd:simpleType name="isbnType">
 <xsd:restriction base="xsd:string">
   <xsd:pattern value="\[0-9]{3}[-][0-9]{3}[-][0-9]{3}"/>
 </xsd:restriction>
</xsd:simpleType>

So the definition of the bookType is not particularly interesting. But the definition of its attribute "isbn" is. Not only does XML Schema support the use of types such as xsd:nonNegativeNumber, but we can also create our own simple types from these basic types using various modifiers. In the example for isbnType above, we base it on a string, and restrict it to match a given regular expression. Excusing my poor regex, that should limit any isbn attribute to match the standard of three groups of three digits separated by a dash.
This is just a simple example, but it should give you a taste of the many things you can do to control the content of an attribute or an element. You have far more control over what is considered a valid XML document using a schema. You can even
  • extend your types from other types you've created,
  • require uniqueness within scope, and
  • provide lookups.
It's a nicely object oriented approach. You could build a library of complexTypes and simpleTypes for re-use throughout many projects, and even find other definitions of common types (such as an "address", for example) from the Internet and use these to provide powerful definitions of your XML documents.
DTD vs XML Schema
The DTD provides a basic grammar for defining an XML Document in terms of the metadata that comprise the shape of the document. An XML Schema provides this, plus a detailed way to define what the data can and cannot contain. It provides far more control for the developer over what is legal, and it provides an Object Oriented approach, with all the benefits this entails.
So, if XML Schemas provide an Object Oriented approach to defining an XML document's structure, and if XML Schemas give us the power to define re-useable types such as an ISBN number based on a wide range of pre-defined types, why would we use a DTD? There are in fact several good reasons for using the DTD instead of the schema.
Firstly, and rather an important point, is that XML Schema is a new technology. This means that whilst some XML Parsers support it fully, many still don't. If you use XML to communicate with a legacy system, perhaps it won't support the XML Schema.
Many systems interfaces are already defined as a DTD. They are mature definitions, rich and complex. The effort in re-writing the definition may not be worthwhile.
DTD is also established, and examples of common objects defined in a DTD abound on the Internet -- freely available for re-use. A developer may be able to use these to define a DTD more quickly than they would be able to accomplish a complete re-development of the core elements as a new schema.
Finally, you must also consider the fact that the XML Schema is an XML document. It has an XML Namespace to refer to, and an XML DTD to define it. This is all overhead. When a parser examines the document, it may have to link this all in, interperate the DTD for the Schema, load the namespace, and validate the schema, etc., all before it can parse the actual XML document in question. If you're using XML as a protocol between two systems that are in heavy use, and need a quick response, then this overhead may seriously degrade performance.
Then again, if your system is available for third party developers as a Web service, then the detailed enforcement of the XML Schema may protect your application a lot more effectively from malicious -- or just plain bad -- XML packets. As an example, Muse.net is an interesting technology. They have a publicly-available SOAP API defined with an XML Schema that provides their developers more control over what they receive from the user community.
On the other hand, I was recently involved in designing a system to handle incoming transactions from multiple devices. In order to scale the system, the chosen service that processes requests is a SOAP server. However, the system is completely closed, and a simple DTD on the server is enough to ensure that the packets sent from the clients arrive complete and uncorrupted, without the additional overhead of XML Schema.
    for details  http://www.brics.dk/~amoeller/XML/schemas

Friday, September 24, 2010

Overview of XML

Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards.
XML's design goals emphasize simplicity, generality, and usability over the Internet. It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.
Many application programming interfaces (APIs) have been developed that software developers use to process XML data, and several schema systems exist to aid in the definition of XML-based languages.
As of 2009, hundreds of XML-based languages have been developed,  including RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for most office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's iWork.


Key terminology

 The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs which appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use.
(Unicode) Character
By definition, an XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.
Processor and Application
The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do, but the application is outside its scope. The processor (as the specification calls it) is often referred to colloquially as an XML parser.
Markup and Content
The characters which make up an XML document are divided into markup and content. Markup and content may be distinguished by the application of simple syntactic rules. All strings which constitute markup either begin with the character "<" and end with a ">", or begin with the character "&" and end with a ";". Strings of characters which are not markup are content.
Tag
A markup construct that begins with "<" and ends with ">". Tags come in three flavors: start-tags, for example <section>, end-tags, for example </section>, and empty-element tags, for example <line-break/>.
Element
A logical component of a document which either begins with a start-tag and ends with a matching end-tag, or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example of an element is <Greeting>Hello, world.</Greeting> (see hello world). Another is <line-break/>.
Attribute
A markup construct consisting of a name/value pair that exists within a start-tag or empty-element tag. In the example (below) the element img has two attributes, src and alt: <img src="madonna.jpg" alt='by Raphael'/>. Another example would be <step number="3">Connect A to B.</step> where the name of the attribute is "number" and the value is "3".
XML Declaration
XML documents may begin by declaring some information about themselves, as in the following example.
<?xml version="1.0" encoding="UTF-8" ?>

Example

Here is a small, complete XML document, which uses all of these constructs and concepts.
<?xml version="1.0" encoding="UTF-8" ?>
<painting>
  <img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
  <caption>This is Raphael's "Foligno" Madonna, painted in
    <date>1511</date><date>1512</date>.
  </caption>
</painting>
There are five elements in this example document: painting, img, caption, and two dates. The date elements are children of caption, which is a child of the root element painting. img has two attributes, src and alt.

XML in 10 points

1. XML is for structuring data

 Structured data includes things like spreadsheets, address books, configuration parameters, financial transactions, and technical drawings. XML is a set of rules (you may also think of them as guidelines or conventions) for designing text formats that let you structure your data. XML is not a programming language, and you don't have to be a programmer to use it or learn it. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous. XML avoids common pitfalls in language design: it is extensible, platform-independent, and it supports internationalization and localization. XML is fully Unicode-compliant.

2. XML looks a bit like HTML

Like HTML, XML makes use of tags (words bracketed by '<' and '>') and attributes (of the form name="value"). While HTML specifies what each tag and attribute means, and often how the text between them will look in a browser, XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it. In other words, if you see "<p>" in an XML file, do not assume it is a paragraph. Depending on the context, it may be a price, a parameter, a person, a p... (and who says it has to be a word with a "p"?).

3. XML is text, but isn't meant to be read

Programs that produce spreadsheets, address books, and other structured data often store that data on disk, using either a binary or text format. One advantage of a text format is that it allows people, if necessary, to look at the data without the program that produced it; in a pinch, you can read a text format with your favorite text editor. Text formats also allow developers to more easily debug applications. Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises. Compared to HTML, the rules for XML files allow fewer variations. A forgotten tag, or an attribute without quotes makes an XML file unusable, while in HTML such practice is often explicitly allowed. The official XML specification forbids applications from trying to second-guess the creator of a broken XML file; if the file is broken, an application has to stop right there and report an error.

4. XML is verbose by design

Since XML is a text format and it uses tags to delimit the data, XML files are nearly always larger than comparable binary formats. That was a conscious decision by the designers of XML. The advantages of a text format are evident (see point 3), and the disadvantages can usually be compensated at a different level. Disk space is less expensive than it used to be, and compression programs like zip and gzip can compress files very well and very fast. In addition, communication protocols such as modem protocols and HTTP/1.1, the core protocol of the Web, can compress data on the fly, saving bandwidth as effectively as a binary format.

5. XML is a family of technologies

XML 1.0 is the specification that defines what "tags" and "attributes" are. Beyond XML 1.0, "the XML family" is a growing set of modules that offer useful services to accomplish important and frequently demanded tasks. XLink describes a standard way to add hyperlinks to an XML file. XPointer is a syntax in development for pointing to parts of an XML document. An XPointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file. CSS, the style sheet language, is applicable to XML as it is to HTML. XSL is the advanced language for expressing style sheets. It is based on XSLT, a transformation language used for rearranging, adding and deleting tags and attributes. The DOM is a standard set of function calls for manipulating XML (and HTML) files from a programming language. XML Schemas 1 and 2 help developers to precisely define the structures of their own XML-based formats. There are several more modules and tools available or under development. Keep an eye on W3C's technical reports page.

6. XML is new, but not that new

Development of XML started in 1996 and it has been a W3C Recommendation since February 1998, which may make you suspect that this is rather immature technology. In fact, the technology isn't very new. Before XML there was SGML, developed in the early '80s, an ISO standard since 1986, and widely used for large documentation projects. The development of HTML started in 1990. The designers of XML simply took the best parts of SGML, guided by the experience with HTML, and produced something that is no less powerful than SGML, and vastly more regular and simple to use. Some evolutions, however, are hard to distinguish from revolutions... And it must be said that while SGML is mostly used for technical documentation and much less for other kinds of data, with XML it is exactly the opposite.

7. XML leads HTML to XHTML

There is an important XML application that is a document format: W3C's XHTML, the successor to HTML. XHTML has many of the same elements as HTML. The syntax has been changed slightly to conform to the rules of XML. A format that is "XML-based" inherits the syntax from XML and restricts it in certain ways (e.g, XHTML allows "<p>", but not "<r>"); it also adds meaning to that syntax (XHTML says that "<p>" stands for "paragraph", and not for "price", "person", or anything else).

8. XML is modular

XML allows you to define a new document format by combining and reusing other formats. Since two formats developed independently may have elements or attributes with the same name, care must be taken when combining those formats (does "<p>" mean "paragraph" from this format or "person" from that one?). To eliminate name confusion when combining formats, XML provides a namespace mechanism. XSL and RDF are good examples of XML-based formats that use namespaces. XML Schema is designed to mirror this support for modularity at the level of defining XML document structures, by making it easy to combine two schemas to produce a third which covers a merged document structure.

9. XML is the basis for RDF and the Semantic Web

W3C's Resource Description Framework (RDF) is an XML text format that supports resource description and metadata applications, such as music playlists, photo collections, and bibliographies. For example, RDF might let you identify people in a Web photo album using information from a personal contact list; then your mail client could automatically start a message to those people stating that their photos are on the Web. Just as HTML integrated documents, images, menu systems, and forms applications to launch the original Web, RDF provides tools to integrate even more, to make the Web a little bit more into a Semantic Web. Just like people need to have agreement on the meanings of the words they employ in their communication, computers need mechanisms for agreeing on the meanings of terms in order to communicate effectively. Formal descriptions of terms in a certain area (shopping or manufacturing, for example) are called ontologies and are a necessary part of the Semantic Web. RDF, ontologies, and the representation of meaning so that computers can help people do work are all topics of the Semantic Web Activity.

10. XML is license-free, platform-independent and well-supported

By choosing XML as the basis for a project, you gain access to a large and growing community of tools (one of which may already do what you need!) and engineers experienced in the technology. Opting for XML is a bit like choosing SQL for databases: you still have to build your own database and your own programs and procedures that manipulate it, but there are many tools available and many people who can help you. And since XML is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor. XML isn't always the best solution, but it is always worth considering.
  for details  http://www.ibiblio.org/bosak/pres/9707ja/sld02000.htm

Regular Expressions

Regular Expressions

A regular expression is an object that describes a pattern of characters.
Regular expression are used to perform pattern-matching and "search-and-replace" functions on text.

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec and test methods of RegExp, and with the match, replace, search, and split methods of String. This chapter describes JavaScript regular expressions. JavaScript 1.1 and earlier. Regular expressions are not available in JavaScript 1.1 and earlier.

Visit below for details
http://www.regular-expressions.info/javascript.html

http://www.evolt.org/regexp_in_javascript
http://www.learn-javascript-tutorial.com/RegularExpressions.cfm 


Syntax

var txt=new RegExp(pattern,modifiers);

or more simply:

var txt=/pattern/modifiers;
  • pattern specifies the pattern of an expression
  • modifiers specify if a search should be global, case-sensitive, etc.
For a tutorial about the RegExp object, read our JavaScript RegExp Object tutorial.

Modifiers

Modifiers are used to perform case-insensitive and global searches:
Modifier Description
 i Perform case-insensitive matching
g Perform a global match (find all matches rather than stopping after the first match)
m Perform multiline matching

Brackets

Brackets are used to find a range of characters:
Expression Description
[abc] Find any character between the brackets
[^abc] Find any character not between the brackets
[0-9] Find any digit from 0 to 9
[A-Z] Find any character from uppercase A to uppercase Z
[a-z] Find any character from lowercase a to lowercase z
[A-z] Find any character from uppercase A to lowercase z
[adgk] Find any character in the given set
[^adgk] Find any character outside the given set
(red|blue|green) Find any of the alternatives specified

Metacharacters

Metacharacters are characters with a special meaning:
Metacharacter Description
. Find a single character, except newline or line terminator
\w Find a word character
\W Find a non-word character
\d Find a digit
\D Find a non-digit character
\s Find a whitespace character
\S Find a non-whitespace character
\b Find a match at the beginning/end of a word
\B Find a match not at the beginning/end of a word
\0 Find a NUL character
\n Find a new line character
\f Find a form feed character
\r Find a carriage return character
\t Find a tab character
\v Find a vertical tab character
\xxx Find the character specified by an octal number xxx
\xdd Find the character specified by a hexadecimal number dd
\uxxxx Find the Unicode character specified by a hexadecimal number xxxx

Quantifiers

Quantifier Description
n+ Matches any string that contains at least one n
n* Matches any string that contains zero or more occurrences of n
n? Matches any string that contains zero or one occurrences of n
n{X} Matches any string that contains a sequence of X n's
n{X,Y} Matches any string that contains a sequence of X or Y n's
n{X,} Matches any string that contains a sequence of at least X n's
n$ Matches any string with n at the end of it
^n Matches any string with n at the beginning of it
?=n Matches any string that is followed by a specific string n
?!n Matches any string that is not followed by a specific string n

RegExp Object Properties

Property Description
global Specifies if the "g" modifier is set
ignoreCase Specifies if the "i" modifier is set
lastIndex The index at which to start the next match
multiline Specifies if the "m" modifier is set
source The text of the RegExp pattern

RegExp Object Methods

Method Description
compile() Compiles a regular expression
exec() Tests for a match in a string. Returns the first match
test() Tests for a match in a string. Returns true or false