Search

Validating XML: A Pretty Complete Primer

0 views

Understanding XML Structure and the Need for Validation

When the Web grew beyond simple hypertext links, developers began looking for a way to describe data that could travel cleanly between systems of all shapes and sizes. Extensible Markup Language, or XML, answered that call by offering a syntax that looks like HTML but is much more flexible. Because every device that can read a text file can read XML, the language is a natural candidate for exchanging weather updates, GPS coordinates, financial feeds, and more.

Imagine a handheld GPS unit that connects to the Internet to fetch the latest weather for the next leg of a trip. That device has limited memory, a small processor, and a very small screen. It cannot afford to spend cycles correcting typos, filling in missing tags, or guessing how a badly formed file should be interpreted. If the GPS were to receive a file that breaks its expectations, it might crash or display nonsense. The only practical solution is to guarantee that the data it receives is already clean before it even arrives.

This guarantee falls to the server that sends the data. Before a GPS, a web browser, or any other client pulls down an XML document, the server must make sure that the file respects the rules of XML and that it follows the stricter constraints of a particular data model. If the file passes that scrutiny, the client can safely assume that its own parser will not have to do any heavy error handling.

The first layer of checks is whether the file is well‑formed. A well‑formed XML document obeys a handful of punctuation rules: every opening tag has a matching closing tag, tags nest properly, and attribute values sit inside quotes. Well‑formedness can be determined by examining only the document itself; no external knowledge is required. For example, a simple weather report might look like this:

Prompt
<report></p> <p> <datestamp>2026-05-12</datestamp></p> <p> <station fullname="Mountain View" abbrev="MV"></p> <p> <latitude>37.386</latitude></p> <p> <longitude>-122.083</longitude></p> <p> </station></p> <p></report>

If that markup fails to follow the rules above, an XML parser will immediately flag it as a syntax error. However, well‑formedness is only the surface of the story. A document can be syntactically correct but semantically meaningless for a particular application.

Document Type Definitions: The First Step Beyond Well‑Formedness

Once you know a file is well‑formed, the next question is whether its structure and content match a business‑level specification. In other words, is the <station> element allowed to appear where it does? Does every <temperature> contain the four expected child elements in the right order? These are questions of validity, and answering them requires an external contract that describes the permissible combinations of elements, attributes, and values.

Consider drafting a plain‑English description of a weather report: a report must contain a datestamp, a station, a temperature, and a wind section, in that order. The station element must have a fullname and abbrev attribute and must house latitude and longitude subelements. The temperature section must contain min, max, forecast‑low, and forecast‑high, while wind may include a direction if the speed is non‑zero. This description serves as a blueprint, but computers cannot parse natural language. We need a machine‑readable form.

Choosing between an element and an attribute is a common design dilemma. A simple rule helps: if the item is a constituent part of a larger structure, make it an element; if it describes the structure itself, make it an attribute. In our example, fullname and abbrev are identifiers that characterize the station, so they belong as attributes. Latitude and longitude are part of the station’s physical description, so they are elements. This choice keeps the markup tidy and aligns with how other markup languages, like HTML, separate content from presentation.

The Document Type Definition (DTD) formalizes the blueprint. A DTD declares what elements exist, how they nest, and which attributes they accept. It also dictates whether elements may appear multiple times or are optional. Here’s a concise DTD for the weather report:

Prompt
<!ELEMENT station (latitude, longitude)></p> <p><!ATTLIST station</p> <p> fullname CDATA #REQUIRED</p> <p> abbrev CDATA #REQUIRED></p> <p><!ELEMENT temperature (min, max, forecast-low, forecast-high)></p> <p><!ELEMENT (min, max, forecast-low, forecast-high) (#PCDATA)></p> <p><!ELEMENT wind (speed, direction?)></p> <p><!ELEMENT (speed, direction) (#PCDATA)>

When a validating parser reads the XML document together with this DTD, it can confirm that each element appears where it should and that required attributes are present. If the document deviates - say, the wind element is missing its speed child or the station has an extra altitude attribute - the parser will flag an error. Thus, DTDs move XML from mere syntax correctness to a stricter form of structure compliance.

However, DTDs have notable blind spots. They cannot express constraints on the content of attribute values or element data beyond the basic character type. For instance, a DTD cannot state that a latitude must lie between –90 and 90 degrees or that a wind direction must be one of N, NE, E, etc. It also cannot enforce value ranges or patterns. These limitations point to the need for richer validation mechanisms that treat data types and value constraints as first‑class citizens.

Advanced Validation: XML Schema and RELAX

XML Schema was designed to fill the gaps left by DTDs. It extends validation to include data types, restrictions, and more expressive structure. XML Schema treats every element and attribute as an instance of a type, such as xs:string, xs:decimal, or a user‑defined type that imposes limits or patterns. With this framework, the same weather report could declare that latitude must be a decimal between –90 and 90, direction must match the regular expression [NSEW]{1,2}, and speed must be a non‑negative number.

Defining a schema for the weather example involves several steps. First, you create simple types that capture basic constraints. Then you declare complex types that group elements together, specifying order, optionality, and multiplicity. Finally, you bind those types to the actual element names. A simplified schema fragment looks like this:

Prompt
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"></p> <p> <xsd:simpleType name="latitudeType"></p> <p> <xsd:restriction base="xsd:decimal"></p> <p> <xsd:minInclusive value="-90"/></p> <p> <xsd:maxInclusive value="90"/></p> <p> </xsd:restriction></p> <p> </xsd:simpleType></p> <p> <xsd:complexType name="stationType"></p> <p> <xsd:sequence></p> <p> <xsd:element name="latitude" type="latitudeType"/></p> <p> <xsd:element name="longitude" type="longitudeType"/></p> <p> </xsd:sequence></p> <p> <xsd:attribute name="fullname" type="xsd:string" use="required"/></p> <p> <xsd:attribute name="abbrev" type="xsd:string" use="required"/></p> <p> </xsd:complexType></p> <p> <xsd:element name="station" type="stationType"/></p> <p></xsd:schema>

XML Schema offers a unified, XML‑based approach that many developers prefer over the older, more cryptic DTD syntax. It integrates smoothly with tools that generate code, validate documents, and enforce business rules. Because it is XML, editors and parsers treat it as just another XML file, making the learning curve gentler.

Another validation notation that has gained traction is RELAX NG, often referred to simply as RELAX. Unlike DTDs, RELAX NG can be expressed in either XML or a more compact compact syntax, and it supports features such as attribute value patterns, choice constructs, and recursion without the need for a separate XML Schema language. The RELAX tutorial at

Suggest a Correction

Found an error or have a suggestion? Let us know and we'll review it.

Share this article

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Related Articles