Introduction to the Basics of XML

Notes from a lecture I gave on what is XML circa 2000. The discussion on DTD is still valid, but superseded by xml schemas.

  1. Pedantic Overview

    What we will learn today:

    1. Why use XML?

    2. What is a tag?

    3. What is an element?

    4. What does it mean that an XML document is "well formed"?

    5. What does it mean that an XML document is "valid"?

    Disclaimer: These lessons are streamlined to teach how to read and write XML files. The real definition of XML is at http://www.w3.org/TR/REC-xml

  2. Introduction

    XML is an acronym for "eXtensible Markup Language". XML has many advantages over other data formats:

    1. Error checking is done automatically

    2. Data can be shared by many different programs

    3. Extensive programming libraries are available for free

    4. Microsoft is making it the foundation of ".NYET". (Speaker at XMLOne: "If you cut someone at Microsoft, they bleed XML").

  3. Naming

    Names of the basic building blocks of XML, elements and attributes, are restricted by the following rules:

    1. Must be composed only of letters (upper and lower case), numbers, hyphens(-), underscores(_), colons(:),and periods(.)
    2. Must start with a letter, '_', or ':'
    3. Cannot begin with "XML" (lower or uppercase or any combination)

    (Although not formally required, colons are used for namespaces)

    Quiz - which are good names?:
    1. survey1
    2. red car
    3. stoplight4_1
    4. -go
    5. go!
    6. _go
    7. 4me
  4. Start Tags

    A start "Tag" is a name that is prepended with a "<" and appended with ">". For example the paragraph tag, "<p>".

    Note: Tags are case sensitive. <House> is different from <house> or <housE>

    Quiz - which are nice start tags?:
    1. <car>
    2. <Car>
    3. >Car<
    4. <caR>
    5. <test one>
    6. <test_2>
    7. <test3
  5. End Tags

    An end "Tag" is a name that is prepended with a "</" and appended with ">". For example the paragraph end tag, "</p>".

    Quiz - which are nice end tags?:
    1. </car>
    2. <Car/>
    3. >Car<
  6. Elements

    Elements have a start tag, optional content, and an end tag. The end tag is the same as the start except it's prepended with a "/". e.g.,

    <p> I'm an element</p>
    <car>...content...</car>

    In common usage "tag" and "element" are often interchanged, although technically an element is composed of a start tag, content, and an end tag.

    If a tag has no content, you have the option to abbreviate the end tag by placing a "/" before the closing ">" in the start tag. For example:

    <br />
    <hr />
    <meta/>
    Quiz - which are nice elements?:
    1. <car>
    2. <art>...contents...</art>
    3. <dna></dna>
    4. <coffee>...contents...</Coffee>
    5. <tea>...contents...<tea/>
    6. <hr/>
    7. </ol>
  7. Element Nesting

    Elements are typically nested inside one another.

    <book name="The Persian Expedition">
       <author name="Xenophon" />
       <chapter title="Persia Awaits">
         <p>It was the spring of the year...</p>
       </chapter>
    </book>

    An important thing to remember is that elements may not overlap:

    The following is incorrect:
    <p> The <em>most <b>important </em> thing </b> is ...
    
  8. Attributes

    "Attributes" are name-value pairs inside start tags. An example is the "border" attribute of the table tag,

     <table border="1">
    

    The general format is

    
     <tag name1="value1" name2="value2" ...  >
    
    General Notes
    1. The attribute must be surrounded by quotes (unlike in HTML)

    2. Either single or double quotes used to surround the attribute value. Single quotes may be embedded inside double quotes and visa versa. But the same type of quote must be used to surround the value.

    3. Most any characters may be placed inside the attribute value except special characters like < and >.

    4. XML purists advocating using very few attributes and putting most information in child elements.

  9. Quiz - which are nice attributes inside these start tags?:
    1. <table bgcolor=red>
    2. </table width="80%">
    3. <img src='mypic.png" />
    4. <img src=mypic.png" />
    5. <br clear="all">
    6. <div class="codeblock" align="left">
    7. <meta http-equiv="pragma" content="no-cache" />
    8. <prefix prefixtype=">" />
  10. Entities

    Entities are string variables. Entities come in two flavors, general and parameter. They are both defined in the DTD.

    1. General

      Although General Entities are defined in the DTD, they are referenced in your XML document. HTML coders will recognize "&amp;" as an example of a general parameter. Another example is the copyright symbol, ©, &copy;.

      Entity references start with "&" and end with ";". The syntax for creating your own Entity in the DTD is

      
      <!ENTITY Name EntityDefinition>
      
      An example would be
      
      <!ENTITY Computer "DELL">
      

      Now everywhere "&Computer;" appears in your document, it will be replaced with "DELL".

    2. Parameter

      These live only in your DTD. If you are repeating a series of attributes many times inside different elements, it may be good to define a parameter entity for them. For example, if many of your elements have an x, y, and z dimension, you could replace the tedious (and perhaps error prone) repeating of them by putting them in a parameter.

      
      <!ENTITY % dimensions "x CDATA #IMPLIED y CDATA #IMPLIED z CDATA #IMPLIED">
      
  11. Well Formed

    An XML document is "Well Formed" when it follows the general rules of XML. For simple documents these include:

    1. All elements have a start and an end tag.

    2. All elements are properly nested.

    3. All name-value pairs in attributes are properly formatted with the appropriate quotes.

    How to test for well formed documents? Open it in IE5.0 and it will complain if it is not.

  12. Valid

    "Valid" XML documents are well formed, but also conform to a specific Document Type Declaration (DTD). The DTD definition includes rules on

    1. How elements may be nested

    2. What attributes an element may have

    3. The contents of those attribute values

    4. Variables that have been defined

    A single DTD may be used for many documents. Millions of XHTML documents may all use the same DTD.

    How to test for valid documents? Our friends at Microsoft have a plugin for IE5 to validate files. Visit http://msdn.microsoft.com/downloads/default.asp?URL=/code/topic.asp?URL=/msdn-files/028/000/072/topic.xml and download the "Internet Explorer Tools for Validating XML and Viewing XSLT Output" package. When you right click on an xml document it will have an option to validate the document.

  13. Document Type Declaration (DTD)

    The two major components of a DTD are the "Element Type Declarations" and "Attribute List Declarations".

  14. Element Type Declarations - general

    One of the great things about xml is that you can define exactly what can be inside it and the order of its contents.

    Inside a DTD, 'Element Type Declarations' describe what an element may contain.

    The general syntax is
    
    <!ELEMENT elementname  contents >
    
    The contents can be one of the following:
    1. list of elements - e.g., (apple|banana|pear)
    2. EMPTY
      No value is contained inside the element.
    3. ANY
      Anything can be inside. Dangerous, in a way, but useful sometimes.
    4. mixed-content - character data and elements
    5. character data

    Examples:

    
    <!ELEMENT br EMPTY>
    <!ELEMENT container ANY>
    (The use of ANY is discouraged, but sometimes it could be helpful).
  15. Element Type Declarations - with children

    Example:

    
    <!ELEMENT book (author,chapter+)>
    

    The allowable children are listed in a group inside parenthesis.

    Special operators tell xml how many of each type are allowed and in what order.

  16. Element Type Declarations - exercise 1:
    Operator Description
    () groups elements
    , separates items that must appear in this order
    | or operator
    ? 0 or 1 elements
    * 0 or more elements
    + 1 or more elements

    Given the following ETD:

    
    <!ELEMENT survey (Head*,Page+)>
    
    Quiz - which are valid contents of survey?:
    1. <Head /><Page /><Page /><Page />
    2. <Head /><Page /><Page />
    3. <Head />
    4. <Page /><Page /><Page />
    5. <Page /><Page /><Page /><Head />
  17. Element Type Declarations - exercise 2:

    Given the following ETD:

    
    <!ELEMENT Page ((Question*|p*)*,Buttons*)>
    
    Quiz - which are valid contents of survey1?:
    1. <Page></Page>
    2. <Page><Buttons /></Page>
    3. <Page><Question /><Buttons /></Page>
    4. <Page><Question /><Question /><Buttons /></Page>
    5. <Page><Question /><Question /><Buttons /><Question /></Page>
    6. <Page><Question /><p /><p /><Question /><Buttons /></Page>
  18. Element Type Declarations - plain ol' text

    Some children will contain just regular old plain text. You declare these with "#PCDATA".

    For Example:
    
    <!ELEMENT QuestionText (#PCDATA)>
    
    Given the following DTD fragment:
    
    <!ELEMENT survey1 (Head*,Page+)>
    <!ELEMENT Page ((Question*|p*)*,Buttons*)>
    <!ELEMENT Question (QuestionText*)>
    <!ELEMENT QuestionText (#PCDATA)>
    Quiz - which are valid contents of survey1?:
    1. <Page><Question><QuestionText>What is your name?</QuestionText></Question></Page>
    2. <Page><Question><QuestionText><p>What is your name?</p></QuestionText></Question></Page>
    3. <Page><Question><QuestionText><p>What is your name?</p></QuestionText></Question><Buttons /></Question></Page>
    4. <Page><p>Thanks for taking our survey!</p></Page>
  19. Attribute List Declarations

    What attributes can my element have, and what can be in their values?

    
    <!ATTLIST attributeName 
    name1 type1 default1
    name2 type2 default2
    ...
    >

    Where type is one of
    1. CDATA character data
    2. ID unique value - one per document
    3. IDREF or IDREFs - points to an ID
    4. ENTITY or ENTITIES
    5. NMTOKEN or NMTOKENS - valid XML Names
    6. enumeration - list of valid strings
    7. NOTATION
    Default value is one of
    1. #REQUIRED element in the xml document must supply this attribute
    2. #IMPLIED this attribute is optional
    3. #FIXED value
    4. value if no value is specified, this one is used
  20. Attribute List Declarations 2

    The two most common are CDATA and enumeration.

    Examples:
    
    <!ATTLIST ChoiceList 
    tableAttributes CDATA #IMPLIED
    HTMLWidget (dropdownlist|radio|checkbox) "dropdownlist"
    debug (yes|no) #IMPLIED
    choicelistdef IDREF #IMPLIED
    >

    The default value appears after an enumeration of choices. If the element in the xml document does not supply a value for this attribute, the default is used.

    When "#IMPLIED" is used, it means the name-value pair is optional.

  21. Attribute List Declarations - quiz

    Given the following Attribute List Declaration:

    
    <!ATTLIST survey1
       name NMTOKEN #REQUIRED
       host CDATA #IMPLIED
       test (yes|no) "yes"
       debug "yes"
    >
    Quiz - which are valid attributes?:
    1. <survey1 host="dizzy" />
    2. <survey1 name="ae4" host="dizzy" />
    3. <survey1 name="ae4" host="dizzy" test="null" />
    4. <survey1 name=" xyz " test="null" />
  22. Attribute List Declarations with Entities

    Typically a DTD makes use of entities. For Example:

    
    <!ENTITY % YesNo "(yes|no)">
    <!ENTITY % Integer "CDATA">
    <!ATTLIST survey1
       show %YesNo; #IMPLIED
       border %Integer; "1.0"
    >
  23. XML documents

    XML documents start with a declaration of version and encoding,


    
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    

    This is typically followed by reference to where the DTD document resides. The word "Coins" below refers to the top level element (the outermost) in the document. The actual dtd file may be across the Internet or on the same machine.


    
    <!DOCTYPE Coins SYSTEM "Coins1.dtd">
    

    In smaller documents the dtd may be embedded in the actual xml document.

  24. Sample XML document
    
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE survey1 PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://localhost/dtd/survey1.dtd">
    <survey1 name="test">
    <Page><Question><QuestionText>What is your name?</QuestionText></Question></Page>
    <Page><p>Thanks for taking our survey!</p></Page>
    </survey1>
  25. Sample DTD document
    
    <!ELEMENT survey1 (Head*,Page+)>
    <!ATTLIST survey1
    name NMTOKEN #REQUIRED
    host CDATA #IMPLIED
    version CDATA "1.0"
    test (yes|no) "yes"
    >

    <!ELEMENT Page ((Question*|p*)*,Buttons*)>
    <!ELEMENT Question (QuestionText*)>
    <!ELEMENT QuestionText (#PCDATA)>

    <!ELEMENT Head (#PCDATA)>
    <!ELEMENT Buttons (#PCDATA)>
    <!ELEMENT p (#PCDATA)>
  26. Sample XML and DTD in one document
    
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE survey1 [
    <!ELEMENT survey1 (Head*,Page+)>
    <!ATTLIST survey1
    name NMTOKEN #REQUIRED
    host CDATA #IMPLIED
    version CDATA "1.0"
    test (yes|no) "yes"
    >

    <!ELEMENT Page ((Question*|p*)*,Buttons*)>
    <!ELEMENT Question (QuestionText*)>
    <!ELEMENT QuestionText (#PCDATA)>

    <!ELEMENT Head (#PCDATA)>
    <!ELEMENT Buttons (#PCDATA)>
    <!ELEMENT p (#PCDATA)>
    ]>

    <survey1 name="test">
    <Page><Question /><Buttons /></Page>
    <Page><Buttons /></Page>
    <Page><Question /><Question /><Buttons /></Page>
    <Page><Question /><Question /><Buttons /></Page>
    <Page><Question /><p /><p /><Question /><Buttons /></Page>

    <Page><Question><QuestionText>What is your name?</QuestionText></Question></Page>
    <Page><Question><QuestionText>What is your name?</QuestionText></Question></Page>
    <Page><p>Thanks for taking our survey!</p></Page>
    </survey1>
  27. How to include one DTD in another. From the XML FAQ.
    
    <!ENTITY % mylists PUBLIC 
    "-//Foo, Inc//ENTITIES Common list structures//EN"
    "dtds/listfrag.ent">
    ...
    %mylists;
  28. How to use CDATA to tell the parser to ignore markup for elements
    
    <AttributeScript attribute="firstBlock">
    <![CDATA[
    if(count < 10) {
    answer.add("Block1");
    } else {
    answer.add("Block2");
    }
    ]]> </AttributeScript>
  29. Online References for XML:
    1. http://www.xmlaustin.org
    2. XML Notepad
    3. www.w3schools.com great tutorials on xml, html, xsl
    4. Guide to XML software
    5. World Wide Web Consortium's standards for XML
    6. http://www.arbortext.com/index.html
    7. http://architag.com/xmlu/
    8. http://msdn.microsoft.com/xml (use ie)
    9. http://www.xml.com
    10. Yahoo's XML links
    11. Sun's development with xml notes
    12. IBM's XML site
    13. http://www.webdeveloper.com/xml/
    14. http://developerlife.com/
    15. Index of free xml tools
    16. http://www.ucc.ie/xml/ XML FAQ
    17. XHTML
    18. http://www.w3.org/MarkUp/
    19. http://www.xhtmlquickref.com/
  30. Pedantic Review

    What we learned today:

    1. Why use XML?

    2. What is a tag?

    3. What is an element?

    4. What does it mean that an XML document is "well formed"?

    5. What does it mean that an XML document is "valid"?

    6. How do I read a simple DTD?