Pages

Saturday, September 25, 2010

DTDs and XML Schemas

What DTDs and XML Schemas 

Document Type Definition (DTD) is a set of markup declarations that define a document type for SGML-family markup languages (SGML, XML, HTML). DTDs were a precursor to XML schema and have a similar function, although different capabilities.
DTDs use a terse formal syntax that declares precisely which elements and references may appear where in the document of the particular type, and what the elements’ contents and attributes are. DTDs also declare entities which may be used in the instance document.
XML uses a subset of SGML DTD.

Document Type Definitions and XML Schemas both provide descriptions of document structures. The emphasis is on making those descriptions readable to automated processors such as parsers, editors, and other XML-based tools. They can also carry information for human consumption, describing what different elements should contain, how they should be used, and what interactions may take place between parts of a document. Although they use very different syntax to achieve this task, they both create documentation.

Perhaps the most important thing DTDs and XML Schemas do is set expectations, using a formal vocabulary and other information to lay ground rules for document structures. Two parsers, given a document and a DTD, should have the same opinions about whether that document is valid, and different schema processors should similarly agree on whether or not a document conforms to the rules in a given schema. XML editing applications can use DTDs and schemas as frameworks, letting users create documents that meet these expectations. Similarly, developers can use DTDs and XML Schemas as a foundation on which to plan transformations from one format to another. By agreeing to a given DTD or schema, a group of developers has accepted a set of rules about document vocabulary and structure. While this doesn't solve all the problems of application development, it does at least mean that independent development of tools that process these documents is a lot easier.
Schemas and DTDs provide a number of additional functions that make contributions to document content:
  • Providing defaults for attributes: in addition to providing constraints on attribute content, DTDs and XML Schemas allow developers to specify default values that should be used if no value was set in the content explicitly.
  • Entity declaration: DTDs and XML Schemas provide for the declaration of parsed entities, which can be referenced from within documents to include content.
Schemas and DTDs may also describe "notations" and "unparsed entities", adding information to documents that applications may use to interpret their content.

Where DTDs and Schemas Come From

The main thrust of development work, initially for XML 1.0 and its DTDs, and now for XML Schemas, is taking place at the World Wide Web Consortium (W3C). However, the W3C is not the only source for schema languages. At least five other schema proposals have been developed and many of them are in actual use -- notably, Microsoft's XML-Data, which is used for its BizTalk initiative. Most of these proposals are feeding into the main W3C-sanctioned development process. The main contenders in the schema arena, including DTDs, are listed below:
  • DTDs - Document Type Definitions were originally developed for XML's predecessor, SGML. They use a very compact syntax and provide document-oriented data typing. XML DTDs are a subset of those available in SGML, and the rules for using XML DTDs provide much of the complexity of XML 1.0. Complete XML DTD support is (or should be) built into all validating XML parsers, and some XML DTD support is built into all XML parsers.
  • XML-Data/XML-Data Reduced - Based on a proposal that Microsoft and others submitted to the W3C even before XML 1.0 was completed, this schema proposal is used in Microsoft's BizTalk framework. XML-Data provides a large set of data types more appropriate to database and program interchange. XML-Data support is built into Microsoft's XML parser.
  • Document Content Description (DCD) - Created in a joint effort between IBM and Microsoft, DCD uses some ideas from XML-Data and some syntax from another W3C project, Resource Description Framework (RDF).
  • Schema for Object-Oriented XML (SOX) - SOX was developed by Veo Systems (now acquired by CommerceOne) and provides functionality like inheritance to XML structures. SOX has gone through multiple versions. The latest is SOX version 2.
  • Document Description Markup Language (DDML) - DDML was developed on the XML-dev mailing list, creating a schema language with a subset of DTD functionality. Development of DDML (which was once known as XSchema) has halted since the W3C Activity began.
Although you can start work with any of the above tools today -- DTDs being widely supported -- when the specification is complete, using the W3C XML Schemas is probably the safest long-term solution. Fortunately, converting among different schema formats isn't especially difficult, and tools are available to help you in the process.

How Schemas Differ from DTDs

The first, and probably most significant, difference between XML Schemas and XML DTDs is that XML Schemas use XML document syntax. While transforming the syntax to XML doesn't automatically improve the quality of the description, it does make those descriptions far more extensible than they were in the original DTD syntax. Declarations can have richer and more complex internal structures than declarations in DTDs, and schema designers can take advantage of XML's containment hierarchies to add extra information where appropriate -- even sophisticated information like documentation. There are a few other benefits from this approach. XML Schemas can be stored along with other XML documents in XML-oriented data stores, referenced, and even styled, using tools like XLink, XPointer, and XSL.
The largest addition XML Schemas provide to the functionality of the descriptions is a vastly improved data typing system. XML Schemas provide data-oriented data types in addition to the more document-oriented data types XML 1.0 DTDs support, making XML more suitable for data interchange applications. Built-in datatypes include strings, booleans, and time values, and the XML Schemas draft provides a mechanism for generating additional data types. Using that system, the draft provides support for all of the XML 1.0 data types (NMTOKENS, IDREFS, etc.) as well as data-specific types like decimal, integer, date, and time. Using XML Schemas, developers can build their own libraries of easily interchanged data types and use them inside schemas or across multiple schemas.
The current draft of XML Schemas also uses a very different style for declaring elements and attributes to DTDs. In addition to declaring elements and attributes individually, developers can create models -- archetypes -- that can be applied to multiple elements and refined if necessary. This provides a lot of the functionality SOX had developed to support object-oriented concepts like inheritance. Archetype development and refinement will probably become the mark of the high-end schema developer, much as the effective use of parameter entities was the mark of the high-end DTD developer. Archetypes should be easier to model and use consistently, however.
XML Schemas also support namespaces, a key feature of the W3C's vision for the future of XML. While it probably wouldn't be impossible to integrate DTDs and namespaces, the W3C has decided to move on, supporting namespaces in its newer developments and not retrofitting XML 1.0. In many cases, provided that namespace-prefixes don't change or simply aren't used, DTD's can work just fine with namespaces, and should be able to interoperate with namespaces and schema processing that relies on namespaces. There will be a few cases, however, where namespaces may force developers to use the newer schemas rather than the older DTDs.

Alternative Approaches

As exciting as XML Schemas are, there have been a few suggestions for very different approaches that also hold promise. Both Rick Jelliffe's Schematron and the Document Structure Description (DSD), from AT&T Labs and the University of Aarhus, look at documents from a more complex perspective than containment, and use tools derived from style languages -- Schematron is based on XSL, while DSD works from CSS -- to examine documents more closely.
Schematron allows developers to ask about the existence and contents of paths through documents rather than specify containment structures, and places great importance on producing human-readable results. Schematron processing, which can use XSL tools, can produce complete reports on the content and structure of documents, rather than a simple yes/no validation with error reporting.
DSD comes from somewhat similar origins, but uses its own vocabulary to create document descriptions rather than building on the XSL processing model. DSD schemas look much more like the W3C's XML Schemas, but support a different set of tests and have a much greater focus on tasks like providing default content for attributes and elements. DSD allows for context-sensitive rules, where the required usage of a given element may change depending on how it is used in a document. Attributes which are optional in one context may be required in another context. Declarations may impose order on some elements but not on others, making it possible to create 'floating' elements. An open-source implementation in C is available, which adds error information to the document as it is processed, giving applications or users a chance to react to the errors.
It isn't clear at this point whether these approaches will be integrated with XML Schemas at some level, or if they'll be useful tools for supplementing or replacing XML Schemas on particular kinds of projects. In any case, both of these projects are worth further investigation.

Planning Around DTDs and Schemas

Transitioning from one technology to another is often difficult, but at least the transition from DTDs to schemas only involves descriptions of documents, requiring only minor changes to the documents themselves. It is uncertain if it's time yet to begin the transition, as the latest public draft of XML Schemas came with a warning on the XML-dev mailing list that there may be significant changes in future drafts. XML Schemas are still far from stable, so probably only the most enthusiastic early adopters should be considering them at this point.
Although XML Schemas may not yet be ready, XML-based projects should be prepared for their eventual arrival. There are several strategies for handling this transition that may be appropriate to different kinds of projects and different developer needs.
  • Develop DTDs with an eye toward future conversion to schemas. Automated tools for converting among schema formats, like Extensibility's XML Authority, are already available and are likely to grow to include the final W3C XML Schemas.
  • Use other schema formats, like XML-Data and SOX. This lets developers take advantage of features like data typing immediately, and conversions from these experimental schema formats to the new XML Schemas shouldn't be prohibitively difficult.
  • Create well-formed documents for now, ignoring DTDs and schemas in their current incarnation. It's not always easy to retrofit a schema onto a set of documents, but it may be appropriate for some cases where the format of existing data sources (like databases) ensures that there's won't be wild variations in structure. When schemas arrive, you can add them to your processing.
  • Ignore DTDs and schemas completely, and only work with well-formed documents. If you don't need structure checking, this may be a perfectly appropriate strategy.
  • Plan to stick to DTDs. They're here now, they'll be here later. If your XML has to be processed by SGML tools, this may be the best route. Keeping your DTDs around, even if you supplement them with equivalent XML Schemas, will preserve interoperability.
There is no single answer for handling this transition that applies to all XML projects. If all your XML work involves documents, DTDs may be a perfectly adequate tool for your needs, and schemas might only be a distraction. If you're trying to manage data interchange between databases of different kinds, the data typing functionality that schemas provide may drive you to use XML-Data or SOX today, and XML Schemas when they arrive.

The Future for DTDs and Schemas

Right now there are too many options for describing your data, but in the future, they will probably slim down to: DTDs, for legacy XML 1.0 applications and integration with SGML; XML Schemas, and plain old well-formed documents for situations where describing document structures is unnecessary or counterproductive. Whatever you do with DTDs and XML Schemas, remember that their usage should be considered a part of document format specification and documentation. Where documentation is important, these tools will be important, both to set expectations and spare applications the task of checking document structures themselves.
The DSD and Schematron approaches will probably receive more attention in future development as well; Schematron is already an easy and useful supplement to both DTD and XML Schema processing. Both of these tools provide functionality that goes beyond anything the W3C has currently released, demonstrating that there are multiple useful approaches to describing document structures. While it seems unlikely that developers will want to create a DTD, an XML Schema, a Schematron schema, and a DSD, all for the same document, they are all important new tools in the XML developer's toolkit.

XML is a very handy format for storing and communicating your data between disparate systems in a platform-independent fashion. XML is more than just a format for computers -- a guiding principle in its creation was that it should be Human Readable and easy to create.
XML allows UNIX systems written in C to communicate with Web Services that, for example, run on the Microsoft .NET architecture and are written in ASP.NET. XML is however, only the meta-language that the systems understand -- and they both need to agree on the format that the XML data will be in. Typically, one of the partners in the process will offer a service to the other: one is in charge of the format of the data.
The definition serves two purposes: the first is to ensure that the data that makes it past the parsing stage is at least in the right structure. As such, it's a first level at which 'garbage' input can be rejected. Secondly, the definition documents the protocol in a standard, formal way, which makes it easier for developers to understand what's available.
DTD - The Document Type Definition
The first method used to provide this definition was the DTD, or Document Type Definition. This defines the elements that may be included in your document, what attributes these elements have, and the ordering and nesting of the elements.
The DTD is declared in a DOCTYPE declaration beneath the XML declaration contained within an XML document:
Inline Definition:

<?xml version="1.0"?>
<!DOCTYPE documentelement [definition]>

External Definition:
<?xml version="1.0"?>
<!DOCTYPE documentelement SYSTEM "documentelement.dtd">

The actual body of the DTD itself contains definitions in terms of elements and their attributes. For example, the following short DTD defines a bookstore. It states that a bookstore has a name, and stocks books on at least one topic.
Each topic has a name and 0 or more books in stock. Each book has a title, author and ISBN number. The name of the topic, and the name of the bookstore are defined as being the same type of element: this store's PCDATA: just text data. The title and author of the book are stored as CDATA -- text data that won't be parsed for further characters by the XML parser. The ISBN number is stored as an attribute of the book:

<!DOCTYPE bookstore [
 <!ELEMENT bookstore (topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
An example of a book store's inline definition might be:
<?xml version="1.0"?>
<!DOCTYPE bookstore [
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">
 ]>
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

Using an inline definition is handy when you only have a few documents and they're offline, as the definition is always in the file. However, if, for example, your DTD defines the XML protocol used to talk between two seperate systems, re-transmitting the DTD with each document adds an overhead to the communciations. Having an external DTD eliminates the need to re-send each time. We could remove the DTD from the document, and place it in a DTD file on a Web server that's accessible by the two systems:

<?xml version="1.0"?>
<!DOCTYPE bookstore SYSTEM "http://webserver/bookstore.dtd">
<bookstore>
 <name>Mike's Store</name>
 <topic>
   <name>XML</name>
   <book isbn="123-456-789">
     <title>Mike's Guide To DTD's and XML Schemas<</title>
     <author>Mike Jervis</author>
   </book>
 </topic>
</bookstore>

The file bookstore.dtd would contain the full defintion in a plain text file:
 <!ELEMENT bookstore (name,topic+)>
 <!ELEMENT topic (name,book*)>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT book (title,author)>
 <!ELEMENT title (#CDATA)>
 <!ELEMENT author (#CDATA)>
 <!ELEMENT isbn (#PCDATA)>
 <!ATTLIST book isbn CDATA "0">

The lowest level of definition in a DTD is that something is either CDATA or PCDATA: Character Data, or Parsed Character Data. We can only define an element as text, and with this limitation, it is not possible, for example, to force an element to be numeric. Attributes can be forced to a range of defined values, but they can't be forced to be numeric.
So for example, if you stored your applications settings in an XML file, it could be manually edited so that the windows start coordinates were strings -- and you'd still need to validate this in your code, rather than have the parser do it for you.
XML Schemas
XML Schemas provide a much more powerful means by which to define your XML document structure and limitations. XML Schemas are themselves XML documents. They reference the XML Schema Namespace (detailed here [1]), and even have their own DTD [2].
What XML Schemas do is provide an Object Oriented approach to defining the format of an XML document. XML Schemas provide a set of basic types. These types are much wider ranging than the basic PCDATA and CDATA of DTDs. They include most basic programming types such as integer, byte, string and floating point numbers, but they also expand into Internet data types such as ISO country and language codes (en-GB for example). A full list can be found here [3].
The author of an XML Schema then uses these core types, along with various operators and modifiers, to create complex types of their own. These complex types are then used to define an element in the XML Document.
As a simple example, let's try to create a basic XML Schema for defining the bookstore that we used as an example for DTDs. Firstly, we must declare this as an XSD Document, and, as we want this to be very user friendly, we're going to add some basic documentation to it:

xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
 <xsd:documentation xlm:lang="en">
   XML Schema for a Bookstore as an example.
 </xsd:documentation>
</xsd:annotation>


Now, in the previous example, the bookstore consisted of the sequence of a name and at least one topic. We can easily do that in an XML Schema:

<xsd:element name="bookstore" type="bookstoreType"/>
<xsd:complexType name="bookstoreType">
 <xsd:sequence>
   <xsd:element name="name" type="xsd:string"/>
   <xsd:element name="topic" type="topicType" minOccurs="1"/>
 </xsd:sequence>
</xsd:complexType>


In this example, we've defined an element, bookstore, that will equate to an XML element in our document. We've defined it of type bookstoreType, which is not a standard type, and so we provide a definition of that type next.
We then define a complexType, which defines bookstoreType as a sequence of name and topic elements. Our "name" type is an xsd:string, a type defined by the XML Schema Namespace, and so we've fully defined that element.
The topic element, however, is of type topicType, another custom type that we must define. We've also defined our topic element with minOccurs="1", which means there must be at least one element at all times. As maxOccurs is not defined, there no upper limit to the number of elements that might be included. If we had specified neither, the default would be exactly one instance, as is used in the name element. Next, we define the schema for the topicType.

<xsd:complexType name="topicType">
 <xsd:element name="name" type="xsd:string"/>
 <xsd:element name="book" type="bookType" minOccurs="0"/>
</xsd:complexType>


This is all similar to the declaration of the bookstoreType, but note that we have to re-define our name element within the scope of this type. If we'd used a complex type for name, such as nameType, which defined only an xsd:string -- and defined it outside our types, we could re-use it in both. However, to illustrate the point, I decided to define it within each section. XML gets interesting when we get to defining our bookType:

<xsd:complexType name="bookType">
 <xsd:element name="title" type="xsd:string"/>
 <xsd:element name="author" type="xsd:string"/>
 <xsd:attribute name="isbn" type="isbnType"/>
</xsd:complexType>
<xsd:simpleType name="isbnType">
 <xsd:restriction base="xsd:string">
   <xsd:pattern value="\[0-9]{3}[-][0-9]{3}[-][0-9]{3}"/>
 </xsd:restriction>
</xsd:simpleType>

So the definition of the bookType is not particularly interesting. But the definition of its attribute "isbn" is. Not only does XML Schema support the use of types such as xsd:nonNegativeNumber, but we can also create our own simple types from these basic types using various modifiers. In the example for isbnType above, we base it on a string, and restrict it to match a given regular expression. Excusing my poor regex, that should limit any isbn attribute to match the standard of three groups of three digits separated by a dash.
This is just a simple example, but it should give you a taste of the many things you can do to control the content of an attribute or an element. You have far more control over what is considered a valid XML document using a schema. You can even
  • extend your types from other types you've created,
  • require uniqueness within scope, and
  • provide lookups.
It's a nicely object oriented approach. You could build a library of complexTypes and simpleTypes for re-use throughout many projects, and even find other definitions of common types (such as an "address", for example) from the Internet and use these to provide powerful definitions of your XML documents.
DTD vs XML Schema
The DTD provides a basic grammar for defining an XML Document in terms of the metadata that comprise the shape of the document. An XML Schema provides this, plus a detailed way to define what the data can and cannot contain. It provides far more control for the developer over what is legal, and it provides an Object Oriented approach, with all the benefits this entails.
So, if XML Schemas provide an Object Oriented approach to defining an XML document's structure, and if XML Schemas give us the power to define re-useable types such as an ISBN number based on a wide range of pre-defined types, why would we use a DTD? There are in fact several good reasons for using the DTD instead of the schema.
Firstly, and rather an important point, is that XML Schema is a new technology. This means that whilst some XML Parsers support it fully, many still don't. If you use XML to communicate with a legacy system, perhaps it won't support the XML Schema.
Many systems interfaces are already defined as a DTD. They are mature definitions, rich and complex. The effort in re-writing the definition may not be worthwhile.
DTD is also established, and examples of common objects defined in a DTD abound on the Internet -- freely available for re-use. A developer may be able to use these to define a DTD more quickly than they would be able to accomplish a complete re-development of the core elements as a new schema.
Finally, you must also consider the fact that the XML Schema is an XML document. It has an XML Namespace to refer to, and an XML DTD to define it. This is all overhead. When a parser examines the document, it may have to link this all in, interperate the DTD for the Schema, load the namespace, and validate the schema, etc., all before it can parse the actual XML document in question. If you're using XML as a protocol between two systems that are in heavy use, and need a quick response, then this overhead may seriously degrade performance.
Then again, if your system is available for third party developers as a Web service, then the detailed enforcement of the XML Schema may protect your application a lot more effectively from malicious -- or just plain bad -- XML packets. As an example, Muse.net is an interesting technology. They have a publicly-available SOAP API defined with an XML Schema that provides their developers more control over what they receive from the user community.
On the other hand, I was recently involved in designing a system to handle incoming transactions from multiple devices. In order to scale the system, the chosen service that processes requests is a SOAP server. However, the system is completely closed, and a simple DTD on the server is enough to ensure that the packets sent from the clients arrive complete and uncorrupted, without the additional overhead of XML Schema.
    for details  http://www.brics.dk/~amoeller/XML/schemas

Friday, September 24, 2010

Overview of XML

Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards.
XML's design goals emphasize simplicity, generality, and usability over the Internet. It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.
Many application programming interfaces (APIs) have been developed that software developers use to process XML data, and several schema systems exist to aid in the definition of XML-based languages.
As of 2009, hundreds of XML-based languages have been developed,  including RSS, Atom, SOAP, and XHTML. XML-based formats have become the default for most office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's iWork.


Key terminology

 The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs which appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use.
(Unicode) Character
By definition, an XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.
Processor and Application
The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do, but the application is outside its scope. The processor (as the specification calls it) is often referred to colloquially as an XML parser.
Markup and Content
The characters which make up an XML document are divided into markup and content. Markup and content may be distinguished by the application of simple syntactic rules. All strings which constitute markup either begin with the character "<" and end with a ">", or begin with the character "&" and end with a ";". Strings of characters which are not markup are content.
Tag
A markup construct that begins with "<" and ends with ">". Tags come in three flavors: start-tags, for example <section>, end-tags, for example </section>, and empty-element tags, for example <line-break/>.
Element
A logical component of a document which either begins with a start-tag and ends with a matching end-tag, or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example of an element is <Greeting>Hello, world.</Greeting> (see hello world). Another is <line-break/>.
Attribute
A markup construct consisting of a name/value pair that exists within a start-tag or empty-element tag. In the example (below) the element img has two attributes, src and alt: <img src="madonna.jpg" alt='by Raphael'/>. Another example would be <step number="3">Connect A to B.</step> where the name of the attribute is "number" and the value is "3".
XML Declaration
XML documents may begin by declaring some information about themselves, as in the following example.
<?xml version="1.0" encoding="UTF-8" ?>

Example

Here is a small, complete XML document, which uses all of these constructs and concepts.
<?xml version="1.0" encoding="UTF-8" ?>
<painting>
  <img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
  <caption>This is Raphael's "Foligno" Madonna, painted in
    <date>1511</date><date>1512</date>.
  </caption>
</painting>
There are five elements in this example document: painting, img, caption, and two dates. The date elements are children of caption, which is a child of the root element painting. img has two attributes, src and alt.

XML in 10 points

1. XML is for structuring data

 Structured data includes things like spreadsheets, address books, configuration parameters, financial transactions, and technical drawings. XML is a set of rules (you may also think of them as guidelines or conventions) for designing text formats that let you structure your data. XML is not a programming language, and you don't have to be a programmer to use it or learn it. XML makes it easy for a computer to generate data, read data, and ensure that the data structure is unambiguous. XML avoids common pitfalls in language design: it is extensible, platform-independent, and it supports internationalization and localization. XML is fully Unicode-compliant.

2. XML looks a bit like HTML

Like HTML, XML makes use of tags (words bracketed by '<' and '>') and attributes (of the form name="value"). While HTML specifies what each tag and attribute means, and often how the text between them will look in a browser, XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it. In other words, if you see "<p>" in an XML file, do not assume it is a paragraph. Depending on the context, it may be a price, a parameter, a person, a p... (and who says it has to be a word with a "p"?).

3. XML is text, but isn't meant to be read

Programs that produce spreadsheets, address books, and other structured data often store that data on disk, using either a binary or text format. One advantage of a text format is that it allows people, if necessary, to look at the data without the program that produced it; in a pinch, you can read a text format with your favorite text editor. Text formats also allow developers to more easily debug applications. Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises. Compared to HTML, the rules for XML files allow fewer variations. A forgotten tag, or an attribute without quotes makes an XML file unusable, while in HTML such practice is often explicitly allowed. The official XML specification forbids applications from trying to second-guess the creator of a broken XML file; if the file is broken, an application has to stop right there and report an error.

4. XML is verbose by design

Since XML is a text format and it uses tags to delimit the data, XML files are nearly always larger than comparable binary formats. That was a conscious decision by the designers of XML. The advantages of a text format are evident (see point 3), and the disadvantages can usually be compensated at a different level. Disk space is less expensive than it used to be, and compression programs like zip and gzip can compress files very well and very fast. In addition, communication protocols such as modem protocols and HTTP/1.1, the core protocol of the Web, can compress data on the fly, saving bandwidth as effectively as a binary format.

5. XML is a family of technologies

XML 1.0 is the specification that defines what "tags" and "attributes" are. Beyond XML 1.0, "the XML family" is a growing set of modules that offer useful services to accomplish important and frequently demanded tasks. XLink describes a standard way to add hyperlinks to an XML file. XPointer is a syntax in development for pointing to parts of an XML document. An XPointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file. CSS, the style sheet language, is applicable to XML as it is to HTML. XSL is the advanced language for expressing style sheets. It is based on XSLT, a transformation language used for rearranging, adding and deleting tags and attributes. The DOM is a standard set of function calls for manipulating XML (and HTML) files from a programming language. XML Schemas 1 and 2 help developers to precisely define the structures of their own XML-based formats. There are several more modules and tools available or under development. Keep an eye on W3C's technical reports page.

6. XML is new, but not that new

Development of XML started in 1996 and it has been a W3C Recommendation since February 1998, which may make you suspect that this is rather immature technology. In fact, the technology isn't very new. Before XML there was SGML, developed in the early '80s, an ISO standard since 1986, and widely used for large documentation projects. The development of HTML started in 1990. The designers of XML simply took the best parts of SGML, guided by the experience with HTML, and produced something that is no less powerful than SGML, and vastly more regular and simple to use. Some evolutions, however, are hard to distinguish from revolutions... And it must be said that while SGML is mostly used for technical documentation and much less for other kinds of data, with XML it is exactly the opposite.

7. XML leads HTML to XHTML

There is an important XML application that is a document format: W3C's XHTML, the successor to HTML. XHTML has many of the same elements as HTML. The syntax has been changed slightly to conform to the rules of XML. A format that is "XML-based" inherits the syntax from XML and restricts it in certain ways (e.g, XHTML allows "<p>", but not "<r>"); it also adds meaning to that syntax (XHTML says that "<p>" stands for "paragraph", and not for "price", "person", or anything else).

8. XML is modular

XML allows you to define a new document format by combining and reusing other formats. Since two formats developed independently may have elements or attributes with the same name, care must be taken when combining those formats (does "<p>" mean "paragraph" from this format or "person" from that one?). To eliminate name confusion when combining formats, XML provides a namespace mechanism. XSL and RDF are good examples of XML-based formats that use namespaces. XML Schema is designed to mirror this support for modularity at the level of defining XML document structures, by making it easy to combine two schemas to produce a third which covers a merged document structure.

9. XML is the basis for RDF and the Semantic Web

W3C's Resource Description Framework (RDF) is an XML text format that supports resource description and metadata applications, such as music playlists, photo collections, and bibliographies. For example, RDF might let you identify people in a Web photo album using information from a personal contact list; then your mail client could automatically start a message to those people stating that their photos are on the Web. Just as HTML integrated documents, images, menu systems, and forms applications to launch the original Web, RDF provides tools to integrate even more, to make the Web a little bit more into a Semantic Web. Just like people need to have agreement on the meanings of the words they employ in their communication, computers need mechanisms for agreeing on the meanings of terms in order to communicate effectively. Formal descriptions of terms in a certain area (shopping or manufacturing, for example) are called ontologies and are a necessary part of the Semantic Web. RDF, ontologies, and the representation of meaning so that computers can help people do work are all topics of the Semantic Web Activity.

10. XML is license-free, platform-independent and well-supported

By choosing XML as the basis for a project, you gain access to a large and growing community of tools (one of which may already do what you need!) and engineers experienced in the technology. Opting for XML is a bit like choosing SQL for databases: you still have to build your own database and your own programs and procedures that manipulate it, but there are many tools available and many people who can help you. And since XML is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor. XML isn't always the best solution, but it is always worth considering.
  for details  http://www.ibiblio.org/bosak/pres/9707ja/sld02000.htm

Regular Expressions

Regular Expressions

A regular expression is an object that describes a pattern of characters.
Regular expression are used to perform pattern-matching and "search-and-replace" functions on text.

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec and test methods of RegExp, and with the match, replace, search, and split methods of String. This chapter describes JavaScript regular expressions. JavaScript 1.1 and earlier. Regular expressions are not available in JavaScript 1.1 and earlier.

Visit below for details
http://www.regular-expressions.info/javascript.html

http://www.evolt.org/regexp_in_javascript
http://www.learn-javascript-tutorial.com/RegularExpressions.cfm 


Syntax

var txt=new RegExp(pattern,modifiers);

or more simply:

var txt=/pattern/modifiers;
  • pattern specifies the pattern of an expression
  • modifiers specify if a search should be global, case-sensitive, etc.
For a tutorial about the RegExp object, read our JavaScript RegExp Object tutorial.

Modifiers

Modifiers are used to perform case-insensitive and global searches:
Modifier Description
 i Perform case-insensitive matching
g Perform a global match (find all matches rather than stopping after the first match)
m Perform multiline matching

Brackets

Brackets are used to find a range of characters:
Expression Description
[abc] Find any character between the brackets
[^abc] Find any character not between the brackets
[0-9] Find any digit from 0 to 9
[A-Z] Find any character from uppercase A to uppercase Z
[a-z] Find any character from lowercase a to lowercase z
[A-z] Find any character from uppercase A to lowercase z
[adgk] Find any character in the given set
[^adgk] Find any character outside the given set
(red|blue|green) Find any of the alternatives specified

Metacharacters

Metacharacters are characters with a special meaning:
Metacharacter Description
. Find a single character, except newline or line terminator
\w Find a word character
\W Find a non-word character
\d Find a digit
\D Find a non-digit character
\s Find a whitespace character
\S Find a non-whitespace character
\b Find a match at the beginning/end of a word
\B Find a match not at the beginning/end of a word
\0 Find a NUL character
\n Find a new line character
\f Find a form feed character
\r Find a carriage return character
\t Find a tab character
\v Find a vertical tab character
\xxx Find the character specified by an octal number xxx
\xdd Find the character specified by a hexadecimal number dd
\uxxxx Find the Unicode character specified by a hexadecimal number xxxx

Quantifiers

Quantifier Description
n+ Matches any string that contains at least one n
n* Matches any string that contains zero or more occurrences of n
n? Matches any string that contains zero or one occurrences of n
n{X} Matches any string that contains a sequence of X n's
n{X,Y} Matches any string that contains a sequence of X or Y n's
n{X,} Matches any string that contains a sequence of at least X n's
n$ Matches any string with n at the end of it
^n Matches any string with n at the beginning of it
?=n Matches any string that is followed by a specific string n
?!n Matches any string that is not followed by a specific string n

RegExp Object Properties

Property Description
global Specifies if the "g" modifier is set
ignoreCase Specifies if the "i" modifier is set
lastIndex The index at which to start the next match
multiline Specifies if the "m" modifier is set
source The text of the RegExp pattern

RegExp Object Methods

Method Description
compile() Compiles a regular expression
exec() Tests for a match in a string. Returns the first match
test() Tests for a match in a string. Returns true or false

JavaScript Objects

JavaScript Objects

This section does not describe the handling of objects in JavaScript since it is assumed the reader possesses object oriented knowledge. It provides information about the creation of JavaScript objects, templates, constructors, and more.

Object Creation

Generally objects may be created using the following syntax:
name = new Object()
For instance Array, Date, Number, Boolean, and String objects can be created this way as in the example below. Most objects may be created this way except for the Math object which cannot be instantiated (an instance of it may not be created).
ratings = new Array(6,9,8,4,5,7,8,10)
var home = new String("Residence")
var futdate = new Date()
var num1 = new Number()
String objects may also be created as follows:
var string = "This is a test."

Object Template

Objects are created using templates. Templates to objects are like cookie cutters to cookies. Cookie cutters are used to create instances of cookies and object templates are used to create instances of objects. For now, the template is shown. The following creates an object, olist.
function olist(elements)
{
   this.elements = elements
   this.listitems = new Array(elements)
   this.getItem = list_getItem
   this.setItem = list_setItem
}
The methods list_setItem and list_getItem are written as follows:
function list_getItem(element)
{
   return this.listitems(element)
}

function list_setItem(element, stringval)
{
   this.listitems(element) - stringval
}



Creating the Object

var list1 = new olist(10)

Changing the Object

list1.setItem(0,"This is the first item in the list")
list1.setItem(1,"This is the second item in the list")
list1.setItem(2,"This is the third item in the list")

The Prototype property

The prototype property can be used to create new properties of created objects (whether they are user defined or system defined) as follows:
list1.prototype.type = "1"

The Function Constructor

Creation:
minus2 = new Function("x","return x-2")
Usage:
y = minus2(10)
The value of y is now 8.

Object use

When writing JavaScript, you are embedding the JavaScript in an object. These objects may have functions. An example function is the alert() function. It may be called as follows:
alert("An error occurred!")
However, the window object is the highest level JavaScript object, therefore the following code does the same thing:
window.alert("An error occurred!")
The following code using the "this" object declaration will perform the same:
this.alert("An error occurred!")

Object and Property Reference

Objects and their properties are normally referenced using a dot between each object or its property or method. as follows:
document.write("This is a test.")
However, a method called array notation, may be used. Array notation is required when the first character of a property name is a digit since the dot method may not be used in this case. The above example in array notation is:
document ["write"] ("This is a test.")

Levels of Objects

JavaScript has basically three types of objects which are:
  • Top level objects
  • Objects that are properties of other objects
  • Objects that are not properties of other objects  


  1. Objects

  2. Object Creation and Use
  3. Object Hierarchy

    Independent Objects

  4. Array Object
  5. Date Object
  6. Math Object
  7. Number and Boolean Object
  8. String Object

    Main Object

  9. Navigator Object

    Navigator Objects

  10. Window Object
  11. MimeType Object
  12. Plugin Object

    Window Sub Objects

  13. Document Object
  14. Location Object
  15. History Object
  16. Frame Object

    Document Sub Objects

  17. Anchor Object
  18. Applet Object
  19. Area Object
  20. Form Object
  21. Image Object
  22. Layer Object
  23. Link Object

    Form Sub Objects

  24. Button Object
  25. Checkbox Object
  26. FileUpload Object
  27. Hidden Object
  28. Option Object
  29. Password Object
  30. Radio Object
  31. Reset Object
  32. Select Object
  33. Submit Object
  34. Text Object
  35. Textarea Object

Hierarchy Objects
Object Properties Methods Event Handlers
Window defaultStatus
frames
opener
parent
scroll
self
status
top
window
alert
blur
close
confirm
focus
open
prompt
clearTimeout
setTimeout
onLoad
onUnload
onBlur
onFocus
Frame defaultStatus
frames
opener
parent
scroll
self
status
top
window
alert
blur
close
confirm
focus
open
prompt
clearTimeout
setTimeout
none (The onLoad and onUnload event handlers belong to the Window object)
Location hash
host
hostname
href
pathname
por
protocol
search
reload
replace
none
History length
forward
go
back none
Navigator appCodeName
appName
appVersion
mimeTypes
plugins
userAgent
javaEnabled none
document alinkColor
anchors
applets
area
bgColor
cookie
fgColor
forms
images
lastModified
linkColor
links
location
referrer
title
vlinkColor
clear
close
open
write
writeln
none (the onLoad and onUnload event handlers belong to the Window object.
image border
complete
height
hspace
lowsrc
name
src
vspace
width
none none
form action
elements
encoding
FileUpload
method
name
target
submit
reset
onSubmit
onReset
text defaultValue
name
type
value
focus
blur
select
onBlur
onCharge
onFocus
onSelect

Built-in Objects
           Array                                  length                                          join
reverse
sort xx
                                none
Date none getDate
getDay
getHours
getMinutes
getMonth
getSeconds
getTime
getTimeZoneoffset
getYear
parse
prototype
setDate
setHours
setMinutes
setMonth
setSeconds
setTime
setYear
toGMTString
toLocaleString
UTC
none
String length
prototype
anchor
big
blink
bold
charAt
fixed
fontColor
fontSize
indexOf
italics
lastIndexOf
link
small
split
strike
sub
substring
sup
toLowerCase
toUpperCase
Window

   

Overview of Java script

What is JavaScript ?

JavaScript is a cross-platform, object-oriented scripting language. JavaScript is a small, lightweight language; it is not useful as a standalone language, but is designed for easy embedding in other products and applications, such as web browsers. Inside a host environment, JavaScript can be connected to the objects of its environment to provide programmatic control over them.
Core JavaScript contains a core set of objects, such as Array, Date, and Math, and a core set of language elements such as operators, control structures, and statements. Core JavaScript can be extended for a variety of purposes by supplementing it with additional objects; for example:
  • Client-side JavaScript extends the core language by supplying objects to control a browser (Navigator or another web browser) and its Document Object Model (DOM). For example, client-side extensions allow an application to place elements on an HTML form and respond to user events such as mouse clicks, form input, and page navigation.
  • Server-side JavaScript extends the core language by supplying objects relevant to running JavaScript on a server. For example, server-side extensions allow an application to communicate with a relational database, provide continuity of information from one invocation to another of the application, or perform file manipulations on a server.
Through JavaScript's LiveConnect functionality, you can let Java and JavaScript code communicate with each other. From JavaScript, you can instantiate Java objects and access their public methods and fields. From Java, you can access JavaScript objects, properties, and methods.
Netscape invented JavaScript, and JavaScript was first used in Netscape browsers.

JavaScript started life as LiveScript, but Netscape changed the name, possibly because of the excitement being generated by Java.to JavaScript. JavaScript made its first appearance in Netscape 2.0 in 1995 with a name LiveScript.

JavaScript is a lightweight, interpreted programming language with object-oriented capabilities that allows you to build interactivity into otherwise static HTML pages.
The general-purpose core of the language has been embedded in Netscape, Internet Explorer, and other web browsers
The ECMA-262 Specification defined a standard version of the core JavaScript language.
JavaScript is:
  • JavaScript is a lightweight, interpreted programming language
  • Designed for creating network-centric applications
  • Complementary to and integrated with Java
  • Complementary to and integrated with HTML
  • Open and cross-platform

Client-side JavaScript:

Client-side JavaScript is the most common form of the language. The script should be included in or referenced by an HTML document for the code to be interpreted by the browser.
It means that a web page need no longer be static HTML, but can include programs that interact with the user, control the browser, and dynamically create HTML content.
The JavaScript client-side mechanism features many advantages over traditional CGI server-side scripts. For example, you might use JavaScript to check if the user has entered a valid e-mail address in a form field.
The JavaScript code is executed when the user submits the form, and only if all the entries are valid they would be submitted to the Web Server.
JavaScript can be used to trap user-initiated events such as button clicks, link navigation, and other actions that the user explicitly or implicitly initiates.

Advantages of JavaScript:

The merits of using JavaScript are:
  • Less server interaction: You can validate user input before sending the page off to the server. This saves server traffic, which means less load on your server.
  • Immediate feedback to the visitors: They don't have to wait for a page reload to see if they have forgotten to enter something.
  • Increased interactivity: You can create interfaces that react when the user hovers over them with a mouse or activates them via the keyboard.
  • Richer interfaces: You can use JavaScript to include such items as drag-and-drop components and sliders to give a Rich Interface to your site visitors.

Limitations with JavaScript:

We can not treat JavaScript as a full fledged programming language. It lacks the following important features:
  • Client-side JavaScript does not allow the reading or writing of files. This has been kept for security reason.
  • JavaScript can not be used for Networking applications because there is no such support available.
  • JavaScript doesn't have any multithreading or multiprocess capabilities.
Once again, JavaScript is a lightweight, interpreted programming language that allows you to build interactivity into otherwise static HTML pages.

JavaScript Development Tools:

One of JavaScript's strengths is that expensive development tools are not usually required. You can start with a simple text editor such as Notepad.
Since it is an interpreted language inside the context of a web browser, you don't even need to buy a compiler.
To make our life simpler, various vendors have come up with very nice JavaScript editing tools. Few of them are listed here:
  • Microsoft FrontPage: Microsoft has developed a popular HTML editor called FrontPage. FrontPage also provides web developers with a number of JavaScript tools to assist in the creation of an interactive web site.
  • Macromedia Dreamweaver MX: Macromedia Dreamweaver MX is a very popular HTML and JavaScript editor in the professional web development crowd. It provides several handy prebuilt JavaScript components, integrates well with databases, and conforms to new standards such as XHTML and XML.
  • Macromedia HomeSite 5: This provided a well-liked HTML and JavaScript editor, which will manage their personal web site just fine.

Where JavaScript is Today ?

The ECMAScript Edition 4 standard will be the first update to be released in over four years. JavaScript 2.0 conforms to Edition 4 of the ECMAScript standard, and the difference between the two is extremely minor.
The specification for JavaScript 2.0 can be found on the following site: http://www.ecmascript.org/
Today, Netscape's JavaScript and Microsoft's JScript conform to the ECMAScript standard, although each language still supports features that are not part of the standard.

Client side programming

Client-side programming  generally refers to the class of computer programs on the web that are executed client-side, by the user's web browser, instead of server-side (on the web server). This type of computer programming is an important part of the Dynamic HTML (DHTML) concept, enabling web pages to be scripted; that is, to have different and changing content depending on user input, environmental conditions (such as the time of day), or other variables.
Web authors write client-side scripts in languages such as JavaScript (Client-side JavaScript) and VBScript.

       Client-side JavaScript (CSJS) is JavaScript that runs on the client-side. While JavaScript was originally created to run this way, the term was coined because the language is no longer limited to just client-side, for example, server-side JavaScript (SSJS) is now available.

Environment

The most common Internet media type attribute for JavaScript source code is text/javascript, which follows HTML 4.01 and HTML 5 specifications and is supported by all major browsers. In 2006, application/javascript was also registered, though Internet Explorer versions 6 through 8 does not recognize scripts with this attribute. When no type attribute is specified in a script tag, the type value is by default "text/javascript" per HTML 5 specification.
To embed JavaScript code in an HTML document, it must be preceded with the <script> tag and followed with </script> (possible attribute options omitted).

Older browsers typically require JavaScript to begin with:
 
<script language="JavaScript" type="text/javascript">
<!--

and end with:
// --> 
</script>

The <!-- ... --> comment markup is required in order to ensure that the code is not rendered as text by very old browsers which do not recognize the <script> tag in HTML documents (although script-tags contained within the head-tag will never be rendered, thus the comment markup is not always necessary), and the LANGUAGE attribute is a deprecated HTML attribute which may be required for old browsers. However, <script> tags in XHTML/XML documents will not work if commented out, as conformant XHTML/XML parsers ignore comments and also may encounter problems with --, < and > signs in scripts (for example, the integer decrement operator and the comparison operators). XHTML documents should therefore have scripts included as XML CDATA sections, by preceding them with
 <script type="text/javascript">
//<![CDATA[

and following them with
 
//]]>
</script>

(A double-slash // at the start of a line marks a JavaScript comment, which prevents the <![CDATA[ and ]]> from being parsed by the script.)
The easiest way to avoid this problem (and also as a best practice) is to use external script,

e.g.:
<script type="text/javascript" src="hello.js"></script>

Historically, a non-standard (non-W3C) attribute language is used in the following context:
 
 <script language="JavaScript" src="hello.js"></script>

HTML elements  may contain intrinsic events to which you can associate a script handler. To write valid HTML 4.01, the web server should return a 'Content-Script-Type' with value 'text/javascript'. If the web server cannot be so configured, the website author can optionally insert the following declaration for the default scripting language in the header section of the document.
 
<meta http-equiv="Content-Script-Type" content="text/javascript" />

 Hello World example

For an explanation of the tradition of programming "Hello World", as well as alternatives to this simplest example, see Hello world program.
This is the easiest method for a Hello world program that involves using popular browsers' support for the virtual 'javascript' protocol to execute JavaScript code. Enter the following as an Internet address (usually by pasting into the address box):
 
javascript:alert('Hello, world!');

 DOM binding

User interaction

Most interaction with the user is done by using HTML forms which can be accessed through the HTML DOM. However there are also some very simple means of communicating with the user:

Events

Element nodes may be the source of various events which can cause an action if a JavaScript event handler is registered. These event handler functions are often defined as anonymous functions directly within the element node.

Incompatibilities

Note: Most incompatibilities are not JavaScript issues but Document Object Model (DOM) specific. The JavaScript implementations of the most popular web browsers usually adhere to the ECMAScript standard, such that most incompatibilities are part of the DOM implementation. Some incompatibility issues that exist across JavaScript implementations include the handling of certain primitive values like "undefined", and the availability of methods introduced in later versions of ECMAScript, such as the .pop(), .push(), .shift(), and .unshift() methods of arrays.
JavaScript, like HTML, is often not compliant to standards, instead being built to work with specific web browsers. The current ECMAScript standard should be the base for all JavaScript implementations in theory, but in practice the Mozilla family of browsers (Mozilla, Firefox and Netscape Navigator) use JavaScript, Microsoft Internet Explorer uses JScript, and other browsers such as Opera and Safari use other ECMAScript implementations, often with additional nonstandard properties to allow compatibility with JavaScript and JScript.
JavaScript and JScript contain several properties which are not part of the official ECMAScript standard, and may also miss several properties. As such, they are in points incompatible, which requires script authors to work around these bugs. JavaScript is more standards-compliant than Microsoft's JScript, which means that a script file written according to the ECMA standards is less likely to work on browsers based on Internet Explorer. However, since there are relatively few points of nonconformance, this is very unlikely.
This also means every browser may treat the same script differently, and what works for one browser may fail in another browser, or even in a different version of the same browser. As with HTML, it is thus advisable to write standards-compliant code.

Combating incompatibilities

There are two primary techniques for handling incompatibilities: browser sniffing and object detection. When there were only two browsers that had scripting capabilities (Netscape and Internet Explorer), browser sniffing was the most popular technique. By testing a number of "client" properties, that returned information on computer platform, browser, and versions, it was possible for a scripter's code to discern exactly which browser the code was being executed in. Later, the techniques for sniffing became more difficult to implement, as Internet Explorer began to "spoof" its client information, that is, to provide browser information that was increasingly inaccurate (the reasons why Microsoft did this are often disputed). Later still, browser sniffing became something of a difficult art form, as other scriptable browsers came onto the market, each with its own platform, client, and version information.
Object detection relies on testing for the existence of a property of an object.
function set_image_source ( imageName, imageURL )
{
    if ( document.images ) 
        // a test to discern if the 'document' object has a property called 'images'
          // which value type-converts to boolean true (as object references do) 
     (
   document.images[imageName].src = imageURL; // only executed if there is an 'images' collection 
   }
}
A more complex example relies on using joined boolean tests:
if ( document.body && document.body.style )
In the above, the statement "document.body.style" would ordinarily cause an error in a browser that does not have a "document.body" property, but using the boolean operator "&&" ensures that "document.body.style" is never called if "document.body" doesn't exist. This technique is called minimal evaluation.
Today, a combination of browser sniffing, object detection, and reliance on standards such as the ECMAScript specification and Cascading Style Sheets are all used to varying degrees to try to ensure that a user never sees a JavaScript error message.