CL-XML: Document-Model-Based Parsing
20030408 (v 0.949)
james anderson,
The default configuration of the CL-XML processor parses an encoded document source
to produce a CLOS document model instance. This model uses a graph of CLOS instances
to represent the encoded document. The base CLOS model class library implements the
XML Query Data Model and presents an Infoset
compatiable programming interface.
This document describes the CLOS implementation for the document model and illustrates
how to use the processor to produce a document model from an encoded document.
Background
Subsequent to the initial specification for XML, several specifications have appeared
to describe the properties and behaviour of the entities which XML is intended to
encode. The common term for these abstract entities is "document models".
One such specification is the W3 XML-DOM. Another
is the W3 XML Query Document Model. Simpler models
have been provided in connection with individual parsers (Electric
XML), More abstract models have been proposed (BeechEA.99)
for analytical purposes and as a basis for processing processing algorithms, such
as validation.
CL-XML implements a document model derived from the XML
Query Data Model, with an interface which conforms to the XML
Infoset specification.
Program Interface
The XQDM package implements a document instance graph with nodes
and operators which constitute an Infoset compatible document model. This model comprises
two levels of classes. Abstract classes implement
graph relations and common properties. Concrete classes
combine and augment the abstract classes to implement the properties and behaviours
specified by the respective recommendations for individual model classes. An overview
of these classes follows. A complete description is reserved to the implementation
source file. Note that these lists are alphabetical and do not reflect logical
dependancies.
Abstract Classes
The root class for the instance model is abstract-node. Numerous
auxiliary abstract classes introduce properties (eg. abstract-value-node,
named-node) or declare type-conformance (eg. elem-child-node,
doc-child-node).
- *
-child-node classes indicate that an instance is suitable to appear
among the children of a parent of the respective type.
def-*-entity classes factor the behaviour of intern/external
and character/general/parameter entities
- {
enumerated,trimming,normalized}-attr-node
classes factor the normalization behavior for attribute classes.
- {
doc,elem,elem-property}-node-interface
classes indicate that a concrete class supports the property accessor interface required
for XML serialization.
XQDM:abstract-attr-node
(ordinal-node doc-child-node elem-node-interface) |
[Abstract Class]
|
XQDM:abstract-def-node () |
[Abstract Class]
|
XQDM:abstract-elem-node
(ordinal-node doc-child-node elem-node-interface) |
[Abstract Class]
|
XQDM:abstract-node () |
[Abstract Class]
|
|
|
XQDM:abstract-ns-node () |
[Abstract Class]
|
XQDM:abstract-top-level-def-node () |
[Abstract Class]
|
XQDM:abstract-value-node (abstract-node) |
[Abstract Class]
|
|
|
XQDM:attr-child-node () |
[Abstract Class]
|
XQDM:attr-node (elem-property-node abstract-attr-node) |
[Abstract Class]
|
XQDM:def-entity
(ncnamed-node abstract-top-level-def-node abstract-value-node) |
[Abstract Class]
|
XQDM:def-external-entity (entity-delegate def-entity) |
[Abstract Class]
|
XQDM:def-general-entity (def-entity) |
[Abstract Class]
|
XQDM:def-internal-entity (def-entity) |
[Abstract Class]
|
XQDM:def-parameter-entity (def-entity) |
[Abstract Class]
|
XQDM:doc-child-node () |
[Abstract Class]
|
XQDM:doc-node-interface () |
[Abstract Class]
|
XQDM:doctype-child-node () |
[Abstract Class]
|
XQDM:document-scoped-node () |
[Abstract Class]
|
|
|
XQDM:elem-child-node (document-scoped-node) |
[Abstract Class]
|
XQDM:elem-node-interface () |
[Abstract Class]
|
XQDM:elem-property-node
(unamed-node abstract-value-node typed-node) |
[Abstract Class]
|
XQDM:elem-property-node-interface () |
[Abstract Class]
|
XQDM:entity-delegate
(ncnamed-node abstract-value-node) |
[Abstract Class]
|
|
|
XQDM:entity-information-node
(ncnamed-node abstract-value-node) |
[Abstract Class]
|
| |
encoding
public-id
system-id
uri
version |
|
XQDM:enumerated-attr-node (trimming-node) |
[Abstract Class]
|
XQDM:named-node (abstract-node) |
[Abstract Class]
|
|
|
XQDM:ncnamed-node (named-node) |
[Abstract Class]
|
XQDM:normalizing-attr-node (attr-node) |
[Abstract Class]
|
XQDM:number-value (value-node elem-child-node attr-child-node) |
[Abstract Class]
|
XQDM:ordinal-node () |
[Abstract Class]
|
|
|
XQDM:ref-attr-node (ref-elem-property-node abstract-attr-node) |
[Abstract Class]
|
XQDM:ref-elem-node (ref-node elem-node-interface) |
[Abstract Class]
|
XQDM:ref-elem-property-node (ref-node abstract-elem-property-node) |
[Abstract Class]
|
XQDM:ref-entity (ref-node named-node attr-child-node) |
[Abstract Class]
|
XQDM:ref-ns-node (ref-elem-property-node abstract-ns-node) |
[Abstract Class]
|
XQDM:trimming-attr-node (attr-node) |
[Abstract Class]
|
XQDM:typed-node (abstract-node) |
[Abstract Class]
|
|
|
XQDM:unamed-node (named-node) |
[Abstract Class]
|
Concrete Classes
The concrete classes include the classes which are essential when modelling a
document entity, and a document definition, classes useful when constructing documents,
and classes useful when modelling documents with typed contents.
doc-node, elem-node, string-attr-node,
ns-node, comment-node, notation-node, and
pi-node classes model the essential features of a document entity.
def-type, def-attr, def-elem-property
classes model the features of document definitions essential for validation and defaulting
according to XML-1.0
character-data-node, def-{internal,external}-{general,parameter}-entity,
ref-{character,general,parameter}-entity,
and conditional-section classes are useful for document construction.
ref-{attr,elem,ns}-node
classes serve as references for xml query operations
- {
entity,entities,id,id-ref,id-refs,notation,string}-attr-node
model attributes typed according to XML-1.0
- {
boolean,decimal,string,...}-{value,attr-node}
model typed contents.
XQDM:boolean-value (value-node elem-child-node attr-child-node) |
[Class]
|
XQDM:binary-value (value-node elem-child-node attr-child-node) |
[Class]
|
XQDM:character-data-node (abstract-value-node doc-child-node) |
[Class]
|
XQDM:comment-node (abstract-value-node doc-child-node doctype-child-node) |
[Class]
|
XQDM:conditional-section (ref-parameter-entity) |
[Class]
|
XQDM:decimal-attr-node (attr-node) |
[Class]
|
XQDM:decimal-value (number-node) |
[Class]
|
XQDM:def-attr
(abstract-top-level-def-node unamed-node qname-context-delegate) |
[Class]
|
XQDM:def-elem-property-node (unamed-node abstract-def-node) |
[Class]
|
XQDM:def-external-general-entity (def-external-entity def-general-entity) |
[Class]
|
XQDM:def-external-parameter-entity
(def-external-entity def-parameter-entity) |
[Class]
|
XQDM:def-internal-general-entity (def-internal-entity def-general-entity) |
[Class]
|
XQDM:def-internal-parameter-entity
(def-internal-entity def-parameter-entity) |
[Class]
|
XQDM:def-notation (abstract-top-level-def-node entity-information-node) |
[Class]
|
XQDM:def-type
(abstract-top-level-def-node unamed-node qname-context-delegate
abstract-value-node) |
[Class]
|
| |
value :reader node-validator
children :reader model
node-class
properties
properties-required
properties-defaulted |
|
XQDM:doc-node (entity-delegate abstract-node) |
[Class]
|
| |
attributes
general-entities
ids
notations
parameter-entities
root
standalone
types
validate
version |
|
XQDM:document-type-declaration-information-node(entity-information-node) |
[Class]
|
XQDM:double-attr-node (attr-node) |
[Class]
|
XQDM:double-value (number-node) |
[Class]
|
XQDM:elem-node (unamed-node typed-node abstract-elem-node) |
[Class]
|
XQDM:entity-attr-node (trimming-node) |
[Class]
|
XQDM:entities-attr-node (attr-node) |
[Class]
|
XQDM:entity-value (value-node attr-child-node) |
[Class]
|
XQDM:enumeration-attr-node (enumeration-attr-node string-attr-node) |
[Class]
|
XQDM:ext-subset-node (entity-delegate abstract-node) |
[Class]
|
XQDM:float-value (number-node) |
[Class]
|
XQDM:function-value (named-value-node) |
[Class]
|
XQDM:id-value (value-node attr-child-node) |
[Class]
|
XQDM:id-ref-value (value-node attr-child-node) |
[Class]
|
XQDM:id-attr-node (trimming-attr-node) |
[Class]
|
XQDM:id-ref-attr-node (trimming-attr-node) |
[Class]
|
XQDM:id-refs-attr-node (trimming-attr-node) |
[Class]
|
XQDM:info-item-node (abstract-node) |
[Class]
|
XQDM:nmtoken-attr-node (trimming-attr-node) |
[Class]
|
XQDM:nmtokens-attr-node (attr-node) |
[Class]
|
XQDM:notation-value (value-node attr-child-node) |
[Class]
|
XQDM:notation-attr-node (enumerated-attr-node) |
[Class]
|
XQDM:ns-node (elem-property-node abstract-ns-node) |
[Class]
|
XQDM:pi-node
(named-node abstract-value-node doc-child-node doctype-child-node) |
[Class]
|
XQDM:qname-attr-node (attr-node) |
[Class]
|
XQDM:qname-context-delegate () |
[Class]
|
|
|
XQDM:qname-value (named-value-node attr-child-node) |
[Class]
|
| |
name
namespace
prefix
uri
value |
|
|
XQDM:recur-dur-attr-node (attr-node) |
[Class]
|
XQDM:recur-dur-value (value-node elem-child-node attr-child-node) |
[Class]
|
XQDM:ref-character-entity (ref-entity elem-child-node) |
[Class]
|
XQDM:ref-attr-node (ref-elem-property-node abstract-attr-node) |
[Class]
|
XQDM:ref-elem-node (ref-node abstract-elem-node) |
[Class]
|
XQDM:ref-general-entity (ref-entity elem-child-node) |
[Class]
|
XQDM:ref-node (abstract-value-node ordinal-node) |
[Class]
|
XQDM:ref-ns-node (ref-elem-property-node abstract-ns-node) |
[Class]
|
XQDM:ref-parameter-entity (ref-entity doctype-child-node) |
[Class]
|
XQDM:string-attr-node (attr-node) |
[Class]
|
XQDM:string-value (value-node elem-child-node attr-child-node) |
[Class]
|
XQDM:time-attr-node (attr-node) |
[Class]
|
XQDM:time-dur-value (value-node elem-child-node attr-child-node) |
[Class]
|
XQDM:uri-ref-attr-node (attr-node) |
[Class]
|
XQDM:uri-ref-value (value-node elem-child-node attr-child-node) |
[Class]
|
Examples
The release directories "xml:tests;" and "xml:demos;" include
numerous examples for how to use the xml parser. The primary interface to the parser
is the function XMLP:document-parser. This function is specialized for numerous input
sources (strings, streams, vectors) and specifications (http, file, and data urls,
pathnames) to parse the specified source to produce a document instance.
Graphs
An excerpt from the graph example demonstrates how to load the parser and parse
document sources:
xml:demos;graphs;text-xml-graph.lisp |
(in-package "XML-PARSER")
;; load the system definition facility
(load "entwicklung@bataille:source:lisp:xml:define-system.lisp")
;; set this pathname appropriately and load the xml system definition
(register-system-definition :xparser
"entwicklung@bataille:source:lisp:xml:sysdcl.lisp")
(unless (system-loaded-p :xparser)
(execute-system-operations :xparser '(:load)))
;; load the graph generator
(load "xml:demos;graphs;xqdm-graph.lisp")
;; parse an encoded document, either from a pathname or from a file-url
(defParameter *channel-dom*
(document-parser #P"xml:tests;xml;channel.xml"))
(defParameter *channel-dom*
(document-parser "file://xml/tests/xml/channel.xml"))
;; write out a .dot graph of the document definition
(write-node-graph (find-def-type '||::|Channel| *channel-dom*) "channel.dot")
;; or the document itself
(write-node-graph *channel-dom* "channelD.dot") |
Alternative Input Sources
The stream wrapper example illustrates how to provide the parser with an alternative
input source. The parser requires only that a method for stream-reader
be defined for the source class to yield a stream of bytes and a method for stream-element-type
be defined to yield a subtype of either byte or character.
Given these two methods, the parser can generate a decoding reader for the source.
Alternatively, a method for decoding-stream-reader can be specialized
directly on the source class and the encoding keyword iff the encoding is specified
to the parser upon invocation. In the event of a parsing error, a method for stream-position
should be available in order to generate diagnostic information.
xml:tests;tests;text-buffer-wrapper.lisp |
;;; -*- Mode: lisp; Syntax: ansi-common-lisp; Base: 10; Package: xml-parser;
-*-
#|
demonstrates how an input source which comprises a buffer sequence could
be re-presented to the parser as a byte stream.
the key function is that which the stream-reader method yields.
the stream-position result is needed only should the parse fail, as
information for the error message.
|#
(in-package "XML-PARSER")
(defClass buffer-wrapper (stream)
((buffer)
(start :initarg :start :initform 0)
(end :initarg :end)
(reader )
(position :initform 0)))
(defMethod initialize-instance ((instance buffer-wrapper)
&rest initargs
&key buffer end stream generator)
(setf (slot-value instance 'buffer) buffer)
(with-slots (buffer reader start end position) instance
(setf reader
#'(lambda (char)
(cond ((>= start end)
(incf position end)
(multiple-value-setq (buffer start end)
(funcall generator stream buffer))
(when buffer
(funcall reader buffer)))
((setf char (aref buffer start))
(incf start)
char)))))
(apply #'call-next-method instance
:end (or end (length buffer))
initargs))
(defMethod stream-reader ((stream buffer-wrapper))
(with-slots (reader) stream
(values reader #\null)))
(defMethod stream-position ((stream buffer-wrapper) &optional new)
(declare (ignore new))
(with-slots (position start) stream
(+ start position)))
(defMethod stream-element-type ((stream buffer-wrapper))
'character)
(let ((buffers '("<asdf>q" "wer&" "#32" ";ty</asdf>")))
(document-parser (make-instance 'buffer-wrapper
:buffer (pop buffers)
:generator #'(lambda (stream buffer)
(declare (ignore stream))
(setf buffer (pop buffers))
(values buffer 0 (length buffer))))))
(write-node * *trace-output*)
:EOF |
Parsing Specific Components
The parser may also be invoked so as to parse specific XML components by passing
the appropriate BNF term as the :start-term keyword when calling the
document-parser function. Note that many terms can be parsed in a context
only. For example, the ExtSubset term established the requisite lexical
context for recognizing the start of a declarations, but declaration terms, such
as ElementDecl, do not. Another issue is that some processing is deferred
until an entire containing term is parsed. For example, qualified names are not resolved
until the entire document type definition is parsed.
xml:tests;test-xml-extsubset.lisp |
;; various declarations are parsed by specifying |ExtSubset| as the start term
(document-parser "data:,<!ELEMENT subjectterm (#PCDATA)>" :start-name
'|ExtSubset|)
;; in order that the parser perform qname resolution, the entire document
;; definition must be parsed
(inspect
(document-parser "data:,<!DOCTYPE doc [
<!ELEMENT doc (a:x)* >
<!ATTLIST doc xmlns CDATA 'data:,ns-top'>
<!ELEMENT a:x EMPTY>
<!ATTLIST a:x xmlns:a CDATA 'data:,ns-a'>
]>" :start-name '|DoctypeDecl|))
;; ... |
© setf.de 2003