A gentle introduction to xml schemas

View this page as series of slides through the magic of xslt.

  1. Goals for Session:
    1. What is an XML Schema?
    2. Why use XML Schemas?
    3. How can Schemas be used in software development?
  2. Obligatory quotes

    "All computer science problems are, in the end, simply graph problems." -attributed to Donald Knuth

    "All XML documents are, in the end, simply graphs." -Mitch Fincher

  3. XML Schema Definition (XSD)

    The XSD language defines the structure and content of an xml document. We can specify which elements and attributes must exist, how many of each, and the types of values inside elements and attributes.

  4. Why use XML Schemas

    Schemas can define objects and their relationships in a descriptive manner. This information can be used as a foundation for applications and entire systems.

    Schemas also allows us to write "contracts" between different applications. If I pass you an xml file that conforms to the "shoe purchase order" XSD, you can skip wasting time analysing the document with custom code, just let the schema checker do it for you. This is more flexible than hard-coding checks in code.

  5. What is the difference between DTD's and XSD's?
    1. XSDs are defined in XML so they are easier to manipulate
    2. Attribute types can be specified. We can tell it that "quantity" should be an integer greater than 0, and sex should be either 'male','female', or 'unknown'.
    3. Inheritance can be used in defining elements.
    4. Schemas can be decomposed into separated files making maintenance easier.

    Schemas are what DTDs wanted to be in the beginning, but didn't know it.

  6. Data Types

    Schemas have two data types, simple and complex.

    1. Simple types include:

      string, normalizedString, token, byte, unsignedByte, base64Binary, hexBinary, integer, positiveInteger, negativeInteger, nonNegativeInteger, nonPositiveInteger, int, unsignedInt, long, unsignedLong, short, unsignedShort, decimal, float, double, boolean, time, dateTime, duration, date, gMonth, gYear, gYearMonth, gDay, gMonthDay, Name, QName, NCName, anyURI, language

      We can use these to enforce data types, like the following:

      <xsd:attribute name="hireDate" type="xsd:date">
      
    2. Complex data is combinations of simple data in a particular structure.

      <complexType name="PurchaseOrderType">
        <sequence>
         <element name="shipingAddress"    type="string"/>
         <element name="billingAddress"    type="string"/>
         <element name="comment" minOccurs="0"/>
        </sequence>
       </complexType>
      
  7. Isn't that like a database schema?

    Yes. Schemas define members, data types and relationships. But schemas have more power than just typical database schemas. Think of XML schemas as a superset of database schemas.

  8. Custom Types and Restrictions

    Schemas are very expressive in specifying restriction of data. Custom types can be created and reused

    1. String length which is always between 1 and 256 characters
      <xsi:attribute name="name" use="required">
          <xsi:simpleType>
            <xsi:restriction base="xsi:string">
              <xsi:maxLength value="256"/>
              <xsi:minLength value="1"/>
            </xsi:restriction>
          </xsi:simpleType>
      </xsi:attribute>
      
    2. Numbers within a certain range

      <xsi:attribute name="singleDigit" use="required">
            <xsi:simpleType>
               <xsi:restriction base="xsi:int">
                  <xsi:minInclusive value="0"/>
                  <xsi:maxInclusive value="9"/>
               </xsi:restriction>
            </xsi:simpleType>
      </xsi:attribute>
      
      
    3. Only numbers in an enumeration

      <xsi:attribute name="littleEvens" use="required">
            <xsi:simpleType>
               <xsi:restriction base="xsi:int">
                  <xsi:enumeration value="0"/>
                  <xsi:enumeration value="2"/>
                  <xsi:enumeration value="4"/>
                  <xsi:enumeration value="6"/>
                  <xsi:enumeration value="8"/>
               </xsi:restriction>
            </xsi:simpleType>
      </xsi:attribute>
      
      
    4. Regular Expression

      W3's definition of Regular Expressions.

      Example of enforcing a regular expression for IP Addresses (eg, 120.52.1.23)

        <xsi:simpleType name="ipAddressType">
          <xsi:restriction base="xsi:string">
            <xsi:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"/>
          </xsi:restriction>
        </xsi:simpleType>
      

      (Why is this not a sufficient restriction for production use?)

  9. Validation

    An example of validating an xml file to it's schema in C#.
    (This is not efficient because the schema is created each call. In real applications, the schema would be cached.)

    /// <summary>
    /// validate an xml file to a schema
    /// </summary>
    /// <param name="xmlString">string representation of xml file</param>
    /// <param name="schemaString">string representation of schema</param>
    /// <returns>null if its happy, otherwise a string containing error messages</returns>
    public static string ValidateXML(string xmlString, string schemaString) {
    	XmlTextReader xmlTextReader = new XmlTextReader(new StringReader(xmlString));
    	XmlValidatingReader xmlValidatingReader = new XmlValidatingReader(xmlTextReader);
    	xmlValidatingReader.ValidationType = ValidationType.Schema;
    	XmlSchema xmlSchema = new XmlSchema();
    	xmlSchema = XmlSchema.Read(new StringReader(schemaString),null);
    	xmlValidatingReader.Schemas.Add(xmlSchema);
    	Validate validate = new Validate();
    	ValidationEventHandler validationEventHandler = new ValidationEventHandler(validate.Validator);
    	xmlValidatingReader.ValidationEventHandler += validationEventHandler;
    	while (xmlValidatingReader.Read());
    	return validate.errors;
    }
    /// <summary>
    /// tiny class so we can have a ValidationEventHandler and collect the errors
    /// </summary>
    private class Validate {
    	public string errors = null;
    	public  void Validator(object sender, ValidationEventArgs args) {
                errors += args.Message + "\r\n";
    	}
    }
    
  10. Using Namespaces

    Namespaces can be used to clarify any confusion over combining schemas.

    <?xml version="1.0"?>
    <sd:SurveyStatus 
    xmlns:sd="https://www.fincher.org/SurveyDirector/1.2" 
    xmlns:xs="https://www.w3.org/2001/XMLSchema-instance" 
    xs:schemaLocation="https://www.fincher.org/SurveyDirector/1.2  https://127.0.0.1/SurveyDirector/xsd/SurveyStatus.xsd" 
    
    name="FordAA" dateStart="8/1/2003 9:13:32 AM" dateEnd="8/31/2003 9:13:32 AM">
      <CellStatus name="tv" pending="0" completed="50" target="100"/>
      <CellStatus name="shopper" pending="1" completed="5" target="90"/>
      <CellStatus name="gender" pending="0" completed="70" target="100"/>
      <CellStatus name="seg" pending="3" completed="70" target="80"/>
    </sd:SurveyStatus>
    

    <?xml version="1.0"?>
    <xsi:schema targetNamespace="https://www.fincher.org/SurveyDirector/1.2" 
    xmlns:sd="https://www.fincher.org/SurveyDirector/1.2" 
    xmlns:xsi="https://www.w3.org/2001/XMLSchema" 
    elementFormDefault="unqualified" 
    attributeFormDefault="unqualified">
      <xsi:element name="SurveyStatus">
        <xsi:complexType>
          <xsi:sequence>
            <xsi:element name="CellStatus" maxOccurs="unbounded">
              <xsi:complexType>
                <xsi:attribute name="name" use="required">
                  <xsi:simpleType>
                    <xsi:restriction base="xsi:string">
                      <xsi:maxLength value="256"/>
                      <xsi:minLength value="1"/>
                    </xsi:restriction>
                  </xsi:simpleType>
                </xsi:attribute>
                <xsi:attribute name="pending" type="xsi:int" use="required"/>
                <xsi:attribute name="completed" type="xsi:int" use="required"/>
                <xsi:attribute name="target" type="xsi:int" use="required"/>
              </xsi:complexType>
            </xsi:element>
          </xsi:sequence>
          <xsi:attribute name="name" type="xsi:string"/>
          <xsi:attribute name="dateStart" type="xsi:date"/>
          <xsi:attribute name="dateEnd" type="xsi:date"/>
        </xsi:complexType>
      </xsi:element>
    </xsi:schema>
    
    
  11. Using No Namespaces

    Sometimes its more convenient to have no namespaces in a schema, like this one.

    <?xml version="1.0" encoding="UTF-8"?>
    <xsi:schema xmlns:xsi="https://www.w3.org/2001/XMLSchema">
      <xsi:element name="Respondent">
        <xsi:annotation>
          <xsi:documentation>SurveyDirector Respondent defines all the information we know about a respondent going through the SurveyDirector system.</xsi:documentation>
        </xsi:annotation>
        <xsi:complexType>
          <xsi:sequence>
            <xsi:element name="RespondentSurvey" minOccurs="0" maxOccurs="unbounded">
              <xsi:complexType>
                <xsi:attribute name="name" type="nameType" use="required"/>
                <xsi:attribute name="surveyStatus" type="xsi:string" use="required"/>
              </xsi:complexType>
            </xsi:element>
            <xsi:element name="RespondentDemographic" type="RespondentTextType" minOccurs="0" maxOccurs="unbounded"/>
            <xsi:element name="RespondentKeyQuestion" type="RespondentSelectType" minOccurs="0" maxOccurs="unbounded"/>
            <xsi:element name="RespondentOccupation" type="RespondentSelectType" minOccurs="0" maxOccurs="unbounded"/>
            <xsi:element name="RespondentSurveysTaken" type="RespondentSelectType" minOccurs="0" maxOccurs="unbounded"/>
            <xsi:element name="RespondentAction" minOccurs="0" maxOccurs="unbounded">
              <xsi:complexType>
                <xsi:attribute name="name" type="nameType" use="required"/>
                <xsi:attribute name="value" type="xsi:string" use="required"/>
              </xsi:complexType>
            </xsi:element>
            
          </xsi:sequence>
          <xsi:attribute name="userID" type="xsi:int" use="required"/>
          <xsi:attribute name="groupName" type="nameType" use="required"/>
          <xsi:attribute name="ipAddress" type="ipAddressType" use="required"/>
          <xsi:attribute name="browser" use="required">
            <xsi:simpleType>
              <xsi:restriction base="xsi:string">
                <xsi:minLength value="0"/>
                <xsi:maxLength value="128"/>
                <xsi:whiteSpace value="preserve"/>
              </xsi:restriction>
            </xsi:simpleType>
          </xsi:attribute>
          <xsi:attribute name="OS" type="nameType" use="required"/>
          <xsi:attribute name="lang" use="required">
            <xsi:simpleType>
              <xsi:restriction base="xsi:string">
                <xsi:minLength value="5"/>
                <xsi:maxLength value="32"/>
                <xsi:whiteSpace value="preserve"/>
              </xsi:restriction>
            </xsi:simpleType>
          </xsi:attribute>
          <xsi:attribute name="status" type="nameType" use="required"/>
          <xsi:attribute name="venue" type="xsi:int" use="required"/>
          
    <xsi:attribute name="test" use="required">
          <xsi:simpleType>
           <xsi:restriction base="xsi:int">
          <xsi:enumeration value="0"/>
          <xsi:enumeration value="1"/>
          </xsi:restriction>
          </xsi:simpleType>
          </xsi:attribute>
           
           
        </xsi:complexType>
      </xsi:element>
      <xsi:element name="RespondentDemographic" type="RespondentTextType"/>
      <xsi:element name="RespondentKeyQuestion" type="RespondentSelectType"/>
      <xsi:element name="RespondentSelect" type="RespondentSelectType"/>
      <xsi:complexType name="RespondentSelectType">
        <xsi:attribute name="name" type="xsi:string" use="required"/>
        <xsi:attribute name="selected" type="xsi:int" use="required"/>
      </xsi:complexType>
      <xsi:complexType name="RespondentTextType">
        <xsi:attribute name="name" type="xsi:string" use="required"/>
        <xsi:attribute name="value" type="xsi:string" use="required"/>
      </xsi:complexType>
      <xsi:simpleType name="nameType">
        <xsi:restriction base="xsi:string">
          <xsi:minLength value="1"/>
          <xsi:maxLength value="32"/>
          <xsi:whiteSpace value="preserve"/>
        </xsi:restriction>
      </xsi:simpleType>
      <xsi:simpleType name="ipAddressType">
        <xsi:restriction base="xsi:string">
          <xsi:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"/>
        </xsi:restriction>
      </xsi:simpleType>
    </xsi:schema>
    
    

    The xml file uses the "noNamespaceSchemaLocation" attribute to reference the schema location.

    <?xml version="1.0" encoding="UTF-8"?>
    <Respondent 
    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="https://127.0.0.1/SurveyDirector/xsd/Respondent.xsd" 
    userID="2" groupName="Baseline" ipAddress="172.30.77.19" 
    browser="Unknown-R2" OS="Unknown" lang="en-UK" 
    status="SentToSurvey:MOONMANG" venue="123" test="1">
      <RespondentSurvey name="MOONMANG" surveyStatus="SentTo"/>
      <RespondentSurvey name="MOONMANG" surveyStatus="ScreenOut"/>
      <RespondentDemographic name="_age" value="43"/>
      <RespondentDemographic name="_seg" value="A"/>
      <RespondentDemographic name="_sex" value="sex_1"/>
      <RespondentKeyQuestion name="cereal" selected="1"/>
      <RespondentKeyQuestion name="diet" selected="0"/>
      <RespondentKeyQuestion name="MMB" selected="0"/>
      <RespondentKeyQuestion name="MMD" selected="1"/>
      <RespondentKeyQuestion name="MME" selected="1"/>
      <RespondentKeyQuestion name="MMF" selected="1"/>
      <RespondentKeyQuestion name="MMH" selected="1"/>
      <RespondentKeyQuestion name="MMI" selected="1"/>
      <RespondentKeyQuestion name="MMJ" selected="1"/>
      <RespondentKeyQuestion name="shampoo2" selected="1"/>
      <RespondentOccupation name="Marketing" selected="1"/>
      <RespondentOccupation name="Food" selected="0"/>
      <RespondentOccupation name="Journalism" selected="0"/>
      <RespondentSurveysTaken name="Political6Months" selected="0"/>
      <RespondentSurveysTaken name="Food6Months" selected="1"/>
      <RespondentSurveysTaken name="Auto3Months" selected="1"/>
    <RespondentAction name="SentToSurvey" value="MOONMANI"/>
    <RespondentAction name="Start" value="Group:Baseline"/>
    <RespondentAction name="ViewPage" value="Demographic1"/>
    <RespondentAction name="ViewPage" value="KeyQuestions"/>
    <RespondentAction name="ViewPage" value="Occupational"/>
    <RespondentAction name="ViewPage" value="PassThru1"/>
    <RespondentAction name="ViewPage" value="SurveysTaken"/>
    </Respondent>
    
    
  12. Include option

    Schemas can reference other schemas located in separate files - very cool. In this example, the element "SurveyStatus" is referenced at another location.

    <?xml version="1.0"?>
    <xsi:schema xmlns:xsi="https://www.w3.org/2001/XMLSchema">
    <!-- include the definition for a SurveyStatus element -->
    <xsi:include id="SurveyStatus" schemaLocation="https://127.0.0.1/SurveyDirector/xsd/SurveyStatus.xsd"></xsi:include>
      <xsi:element name="GroupStatus">
        <xsi:complexType>
          <xsi:sequence>
              <xsi:element name="SurveyStatusCollection" minOccurs="1" maxOccurs="1">
              <xsi:complexType>
              <xsi:sequence>
              <xsi:element ref="SurveyStatus" maxOccurs="unbounded"></xsi:element>
              </xsi:sequence>
              </xsi:complexType>
              </xsi:element>
          </xsi:sequence>
          <xsi:attribute name="name" type="xsi:string" use="required"/>
          <xsi:attribute name="lang" type="xsi:string" use="required"/>
        </xsi:complexType>
      </xsi:element>
    </xsi:schema>
    
    

    The referenced schema

    <?xml version="1.0"?>
    <xsi:schema xmlns:xsi="https://www.w3.org/2001/XMLSchema">
      <xsi:element name="SurveyStatus">
        <xsi:complexType>
          <xsi:sequence>
            <xsi:element name="CellStatus" maxOccurs="unbounded" minOccurs="0">
              <xsi:complexType>
                <xsi:attribute name="name" use="required">
                  <xsi:simpleType>
                    <xsi:restriction base="xsi:string">
                      <xsi:maxLength value="256"/>
                      <xsi:minLength value="1"/>
                    </xsi:restriction>
                  </xsi:simpleType>
                </xsi:attribute>
                <xsi:attribute name="value" type="xsi:string" use="required"/>
                <xsi:attribute name="pending" type="xsi:int" use="required"/>
                <xsi:attribute name="completed" type="xsi:int" use="required"/>
                <xsi:attribute name="target" type="xsi:int" use="required"/>
              </xsi:complexType>
            </xsi:element>
          </xsi:sequence>
          <xsi:attribute name="name" type="xsi:string" use="required"/>
          <xsi:attribute name="dateStart" type="xsi:string" use="required"/>
          <xsi:attribute name="dateEnd" type="xsi:string" use="required"/>
          <xsi:attribute name="quotaSetIndex" type="xsi:int" use="required"/>
        </xsi:complexType>
      </xsi:element>
    </xsi:schema>
    
    

    
    
  13. Three-fold Diagram (in SVG)

    XML, C# objects, and database objects are really quite similar. Each contains smaller objects which have types and relationships. Many times we are looking at the same real object, but just a different view.



    These are three views of a single object, but what is in the middle of the box? XSD.

    (What is SVG?)

  14. Example of the Three-Fold Way

    This is C# code, but it contains all the attributes needed to generate the database.

    [Database(unique=true), PrimaryKey("Project_SurveyGroup"),ForeignKey("'Survey","Project")] 
    public int Project_Survey {
    	get {if(projectNum == -1) {
    			 projectNum = GetSASSurveyID(name);
    		 }
    		return projectNum;
    	}
    }
    [Database(),ForeignKey("Group","Project")] 
    public int Project_SurveyGroup {
    	get {if(groupNum == -1) {
    			 groupNum = Group.GetGroup(groupName).Project;
    		 }
    		return groupNum;
    	}
    	set {groupNum = value;}
    }
    [XML(), Database()] public int priority 
    { 
    	get {return GetAttributeInt("priority");}
    	set {SetAttributeInt("priority",value);}
    }
    
  15. Future Use of Schemas in Software Development

    Currently we build java or C# objects, database schemas, data access layers, GUIs to edit our code objects, code to check the validity of GUI entered data, and code to import and export data from databases. Not only is this time consuming, but errors can creep into the code, since definitions appear multiple times.

    DRY Principle: Don't Repeat Yourself.

    With XSD as the single repository for metadata about objects, we can create a framework (like MDA) to make transitions between the phases of the objects automatic.

    With an XSD for each object, the framework can do the following:

    1. autogenerate the C# or java base code for an object (or something like Reflection emit in C#).
    2. generate the database schema
    3. generate all the database INSERTs, and UPDATEs (either as SQL statements or stored procedures)
    4. generate all the DELETEs for the object, including deleting all the exclusive children of this object.
    5. generate the HTML code to create or edit an object and do error catching based on the XSD restrictions

    Some special constraints will, of course, have to be hand-coded, but the vast bulk of code can be generated, or automated.

  16. Example from our latest project.
  17. My favorite references:
    1. www.w3.org overview and Definition.
    2. Introduction
    3. xml.com's introduction to schemas.
  18. Conclusion:

    The future of software development will be the use of graphical tools in a framework that encode UML diagrams into XML Schemas. These Schemas will be used to generate the database schema, the data access layer (object persistance), the C# and Java code base layer, the user GUI, and data validation of user entered GUI data. This is part of Model Driven Architecture.