CL-XML: Document-Model-Based Parsing

20030408 (v 0.949)
james anderson,



[background] [program interface][examples]


The default configuration of the CL-XML processor parses an encoded document source to produce a CLOS document model instance. This model uses a graph of CLOS instances to represent the encoded document. The base CLOS model class library implements the XML Query Data Model and presents an Infoset compatiable programming interface.

This document describes the CLOS implementation for the document model and illustrates how to use the processor to produce a document model from an encoded document.


[Top]

Background

Subsequent to the initial specification for XML, several specifications have appeared to describe the properties and behaviour of the entities which XML is intended to encode. The common term for these abstract entities is "document models". One such specification is the W3 XML-DOM. Another is the W3 XML Query Document Model. Simpler models have been provided in connection with individual parsers (Electric XML), More abstract models have been proposed (BeechEA.99) for analytical purposes and as a basis for processing processing algorithms, such as validation.

CL-XML implements a document model derived from the XML Query Data Model, with an interface which conforms to the XML Infoset specification.


[Top]

Program Interface

The XQDM package implements a document instance graph with nodes and operators which constitute an Infoset compatible document model. This model comprises two levels of classes. Abstract classes implement graph relations and common properties. Concrete classes combine and augment the abstract classes to implement the properties and behaviours specified by the respective recommendations for individual model classes. An overview of these classes follows. A complete description is reserved to the implementation source file. Note that these lists are alphabetical and do not reflect logical dependancies.

Abstract Classes

The root class for the instance model is abstract-node. Numerous auxiliary abstract classes introduce properties (eg. abstract-value-node, named-node) or declare type-conformance (eg. elem-child-node, doc-child-node).

XQDM:abstract-attr-node
     (ordinal-node doc-child-node elem-node-interface)

[Abstract Class]

XQDM:abstract-def-node ()

[Abstract Class]

XQDM:abstract-elem-node
     (ordinal-node doc-child-node elem-node-interface)

[Abstract Class]

XQDM:abstract-node ()

[Abstract Class]

  parent
children
XQDM:abstract-ns-node ()

[Abstract Class]

XQDM:abstract-top-level-def-node ()

[Abstract Class]

XQDM:abstract-value-node (abstract-node)

[Abstract Class]

  value
XQDM:attr-child-node ()

[Abstract Class]

XQDM:attr-node (elem-property-node abstract-attr-node)

[Abstract Class]

XQDM:def-entity
     (ncnamed-node abstract-top-level-def-node abstract-value-node)

[Abstract Class]

XQDM:def-external-entity (entity-delegate def-entity)

[Abstract Class]

XQDM:def-general-entity (def-entity)

[Abstract Class]

XQDM:def-internal-entity (def-entity)

[Abstract Class]

XQDM:def-parameter-entity (def-entity)

[Abstract Class]

XQDM:doc-child-node ()

[Abstract Class]

XQDM:doc-node-interface ()

[Abstract Class]

XQDM:doctype-child-node ()

[Abstract Class]

XQDM:document-scoped-node ()

[Abstract Class]

  document
XQDM:elem-child-node (document-scoped-node)

[Abstract Class]

XQDM:elem-node-interface ()

[Abstract Class]

XQDM:elem-property-node
     (unamed-node abstract-value-node typed-node)

[Abstract Class]

XQDM:elem-property-node-interface ()

[Abstract Class]

XQDM:entity-delegate
     (ncnamed-node abstract-value-node)

[Abstract Class]

  entity-info
XQDM:entity-information-node
     (ncnamed-node abstract-value-node)

[Abstract Class]

  encoding
public-id
system-id
uri
version
XQDM:enumerated-attr-node (trimming-node)

[Abstract Class]

XQDM:named-node (abstract-node)

[Abstract Class]

  name
XQDM:ncnamed-node (named-node)

[Abstract Class]

XQDM:normalizing-attr-node (attr-node)

[Abstract Class]

XQDM:number-value (value-node elem-child-node attr-child-node)

[Abstract Class]

XQDM:ordinal-node ()

[Abstract Class]

  ordinality
XQDM:ref-attr-node (ref-elem-property-node abstract-attr-node)

[Abstract Class]

XQDM:ref-elem-node (ref-node elem-node-interface)

[Abstract Class]

XQDM:ref-elem-property-node (ref-node abstract-elem-property-node)

[Abstract Class]

XQDM:ref-entity (ref-node named-node attr-child-node)

[Abstract Class]

XQDM:ref-ns-node (ref-elem-property-node abstract-ns-node)

[Abstract Class]

XQDM:trimming-attr-node (attr-node)

[Abstract Class]

XQDM:typed-node (abstract-node)

[Abstract Class]

  def
XQDM:unamed-node (named-node)

[Abstract Class]

Concrete Classes

The concrete classes include the classes which are essential when modelling a document entity, and a document definition, classes useful when constructing documents, and classes useful when modelling documents with typed contents.

XQDM:boolean-value (value-node elem-child-node attr-child-node)

[Class]

XQDM:binary-value (value-node elem-child-node attr-child-node)

[Class]

XQDM:character-data-node (abstract-value-node doc-child-node)

[Class]

XQDM:comment-node (abstract-value-node doc-child-node doctype-child-node)

[Class]

XQDM:conditional-section (ref-parameter-entity)

[Class]

XQDM:decimal-attr-node (attr-node)

[Class]

XQDM:decimal-value (number-node)

[Class]

XQDM:def-attr
     (abstract-top-level-def-node unamed-node qname-context-delegate)

[Class]

XQDM:def-elem-property-node (unamed-node abstract-def-node)

[Class]

XQDM:def-external-general-entity (def-external-entity def-general-entity)

[Class]

XQDM:def-external-parameter-entity
     (def-external-entity def-parameter-entity)

[Class]

XQDM:def-internal-general-entity (def-internal-entity def-general-entity)

[Class]

XQDM:def-internal-parameter-entity
     (def-internal-entity def-parameter-entity)

[Class]

XQDM:def-notation (abstract-top-level-def-node entity-information-node)

[Class]

XQDM:def-type
     (abstract-top-level-def-node unamed-node qname-context-delegate
      abstract-value-node)

[Class]

  value :reader node-validator
children :reader model
node-class
properties
properties-required
properties-defaulted
XQDM:doc-node (entity-delegate abstract-node)

[Class]

  attributes
general-entities
ids
notations
parameter-entities
root
standalone
types
validate
version
XQDM:document-type-declaration-information-node(entity-information-node)

[Class]

XQDM:double-attr-node (attr-node)

[Class]

XQDM:double-value (number-node)

[Class]

XQDM:elem-node (unamed-node typed-node abstract-elem-node)

[Class]

XQDM:entity-attr-node (trimming-node)

[Class]

XQDM:entities-attr-node (attr-node)

[Class]

XQDM:entity-value (value-node attr-child-node)

[Class]

XQDM:enumeration-attr-node (enumeration-attr-node string-attr-node)

[Class]

XQDM:ext-subset-node (entity-delegate abstract-node)

[Class]

XQDM:float-value (number-node)

[Class]

XQDM:function-value (named-value-node)

[Class]

XQDM:id-value (value-node attr-child-node)

[Class]

XQDM:id-ref-value (value-node attr-child-node)

[Class]

XQDM:id-attr-node (trimming-attr-node)

[Class]

XQDM:id-ref-attr-node (trimming-attr-node)

[Class]

XQDM:id-refs-attr-node (trimming-attr-node)

[Class]

XQDM:info-item-node (abstract-node)

[Class]

XQDM:nmtoken-attr-node (trimming-attr-node)

[Class]

XQDM:nmtokens-attr-node (attr-node)

[Class]

XQDM:notation-value (value-node attr-child-node)

[Class]

XQDM:notation-attr-node (enumerated-attr-node)

[Class]

XQDM:ns-node (elem-property-node abstract-ns-node)

[Class]

XQDM:pi-node
     (named-node abstract-value-node doc-child-node doctype-child-node)

[Class]

XQDM:qname-attr-node (attr-node)

[Class]

XQDM:qname-context-delegate ()

[Class]

  qname-context
XQDM:qname-value (named-value-node attr-child-node)

[Class]

  name
namespace
prefix
uri
value
 
XQDM:recur-dur-attr-node (attr-node)

[Class]

XQDM:recur-dur-value (value-node elem-child-node attr-child-node)

[Class]

XQDM:ref-character-entity (ref-entity elem-child-node)

[Class]

XQDM:ref-attr-node (ref-elem-property-node abstract-attr-node)

[Class]

XQDM:ref-elem-node (ref-node abstract-elem-node)

[Class]

XQDM:ref-general-entity (ref-entity elem-child-node)

[Class]

XQDM:ref-node (abstract-value-node ordinal-node)

[Class]

XQDM:ref-ns-node (ref-elem-property-node abstract-ns-node)

[Class]

XQDM:ref-parameter-entity (ref-entity doctype-child-node)

[Class]

XQDM:string-attr-node (attr-node)

[Class]

XQDM:string-value (value-node elem-child-node attr-child-node)

[Class]

XQDM:time-attr-node (attr-node)

[Class]

XQDM:time-dur-value (value-node elem-child-node attr-child-node)

[Class]

XQDM:uri-ref-attr-node (attr-node)

[Class]

XQDM:uri-ref-value (value-node elem-child-node attr-child-node)

[Class]


[Top]

Examples

The release directories "xml:tests;" and "xml:demos;" include numerous examples for how to use the xml parser. The primary interface to the parser is the function XMLP:document-parser. This function is specialized for numerous input sources (strings, streams, vectors) and specifications (http, file, and data urls, pathnames) to parse the specified source to produce a document instance.

Graphs

An excerpt from the graph example demonstrates how to load the parser and parse document sources:
xml:demos;graphs;text-xml-graph.lisp
(in-package "XML-PARSER")

;; load the system definition facility
(load "entwicklung@bataille:source:lisp:xml:define-system.lisp")

;; set this pathname appropriately and load the xml system definition
(register-system-definition :xparser
"entwicklung@bataille:source:lisp:xml:sysdcl.lisp")
(unless (system-loaded-p :xparser)
(execute-system-operations :xparser '(:load)))


;; load the graph generator
(load "xml:demos;graphs;xqdm-graph.lisp")

;; parse an encoded document, either from a pathname or from a file-url
(defParameter *channel-dom*
(document-parser #P"xml:tests;xml;channel.xml"))
(defParameter *channel-dom*
(document-parser "file://xml/tests/xml/channel.xml"))

;; write out a .dot graph of the document definition
(write-node-graph (find-def-type '||::|Channel| *channel-dom*) "channel.dot")
;; or the document itself
(write-node-graph *channel-dom* "channelD.dot")

Alternative Input Sources

The stream wrapper example illustrates how to provide the parser with an alternative input source. The parser requires only that a method for stream-reader be defined for the source class to yield a stream of bytes and a method for stream-element-type be defined to yield a subtype of either byte or character. Given these two methods, the parser can generate a decoding reader for the source. Alternatively, a method for decoding-stream-reader can be specialized directly on the source class and the encoding keyword iff the encoding is specified to the parser upon invocation. In the event of a parsing error, a method for stream-position should be available in order to generate diagnostic information.

xml:tests;tests;text-buffer-wrapper.lisp
;;; -*- Mode: lisp; Syntax: ansi-common-lisp; Base: 10; Package: xml-parser; -*-

#|
demonstrates how an input source which comprises a buffer sequence could
be re-presented to the parser as a byte stream.

the key function is that which the stream-reader method yields.
the stream-position result is needed only should the parse fail, as
information for the error message.

|#

(in-package "XML-PARSER")

(defClass buffer-wrapper (stream)
((buffer)
(start :initarg :start :initform 0)
(end :initarg :end)
(reader )
(position :initform 0)))

(defMethod initialize-instance ((instance buffer-wrapper)
&rest initargs
&key buffer end stream generator)
(setf (slot-value instance 'buffer) buffer)
(with-slots (buffer reader start end position) instance
(setf reader
#'(lambda (char)
(cond ((>= start end)
(incf position end)
(multiple-value-setq (buffer start end)
(funcall generator stream buffer))
(when buffer
(funcall reader buffer)))
((setf char (aref buffer start))
(incf start)
char)))))

(apply #'call-next-method instance
:end (or end (length buffer))
initargs))

(defMethod stream-reader ((stream buffer-wrapper))
(with-slots (reader) stream
(values reader #\null)))

(defMethod stream-position ((stream buffer-wrapper) &optional new)
(declare (ignore new))
(with-slots (position start) stream
(+ start position)))

(defMethod stream-element-type ((stream buffer-wrapper))
'character)

(let ((buffers '("<asdf>q" "wer&" "#32" ";ty</asdf>")))
(document-parser (make-instance 'buffer-wrapper
:buffer (pop buffers)
:generator #'(lambda (stream buffer)
(declare (ignore stream))
(setf buffer (pop buffers))
(values buffer 0 (length buffer))))))
(write-node * *trace-output*)

:EOF

Parsing Specific Components

The parser may also be invoked so as to parse specific XML components by passing the appropriate BNF term as the :start-term keyword when calling the document-parser function. Note that many terms can be parsed in a context only. For example, the ExtSubset term established the requisite lexical context for recognizing the start of a declarations, but declaration terms, such as ElementDecl, do not. Another issue is that some processing is deferred until an entire containing term is parsed. For example, qualified names are not resolved until the entire document type definition is parsed.

xml:tests;test-xml-extsubset.lisp
;; various declarations are parsed by specifying |ExtSubset| as the start term

(document-parser "data:,<!ELEMENT subjectterm (#PCDATA)>" :start-name '|ExtSubset|)

;; in order that the parser perform qname resolution, the entire document
;; definition must be parsed
(inspect
(document-parser "data:,<!DOCTYPE doc [
<!ELEMENT doc (a:x)* >
<!ATTLIST doc xmlns CDATA 'data:,ns-top'>
<!ELEMENT a:x EMPTY>
<!ATTLIST a:x xmlns:a CDATA 'data:,ns-a'>
]>" :start-name '|DoctypeDecl|))

;; ...


[Top]

© setf.de 2003