CL-XML: Event-Based Parsing

20030408 (v 0.949)
james anderson,



[background] [program interface][examples]


The CL-XML processor can be invoked so as to communicate individual parse events to an invoking application. This behaviour can be instead of or in addition to the default behaviour as a model-based XML processor, An application can implement handlers for these events so as to control both the products of the parser and its resource usage. The mechanism operates in addition to that provided to specialize the implementations of the document model nodes which the processor generates.

This document describes the mechanism for processing parse events and illustrates its use in an event-based RDF parser.

[Top]

Background

The prevalent interface for event-based XML processing is the Simple Application programming interface for XML, SAX. SAX serves variously as an autonomous event-based XML parser for Java applications, as the standard event-driver for numerous Java-based parsers (Xerces, JAXP), and as the preprocessor for numerous other XML tools (SAXON, XDK).

The core of the current generation of SAX-parsers is the org.xml.sax.ContentHandler interface, which specifies the parsing events reported to the application. While this interface does, in keeping with its name, provide a concise report of the document content, the match with the CL-XML parser is too course-grained for it to serve as the primary event-based interface. For this reason, two event-based interfaces are provided. The first, lower level interface permits an application to specify a surrogate, a construction context, to handle an event stream is generated by the parser directly in the course of phrase reduction. this permits the application detailed access to all lexical entitiesand to the the process by which the parser constructs the document model. A higher level, SAX-equivalent, stream is generated by a special form of construction context which uses the detailed events to build and generate SAX-equivalent events.



Progam Interface

Construction Events

The lower-level construction event interface comprises that subset of the parser's reduction functions which are defined with a context parameter in addition to the parameters for the properties of the respective term which is to be reduced. The interface to these functions is governed by the naming and constitution of those terms in the BNF which denote model nodes and properties. The actual terms are specified by the implementation of the bnfp:atn-constructor-specializer method specialized for the bnfp:atn-edge class which is engaged during parser generation. Note that these functions are distinguished by case-sensitive names. In addition to these functions, several auxiliary functions are specialized in order to provide finer grained control over the reduction/construction process. These are distinguished by names which begin with construct-. (see "xml;code:xparser:xml-constructors.lisp" for the respective documentation.)

XMLP:|AttCharData-Constructor| context att-value name

[Generic Function]

XMLP:|Attribute-Constructor| context att-value name

[Generic Function]


XMLP:|CharData-Constructor| context character-data

[Generic Function]


XMLP:|CDataCharData-Constructor| character-data

[Generic Function]


XMLP:|Comment-Constructor| character-data

[Generic Function]


XMLP:construct-attribute-name context name

[Generic Function]


XMLP:construct-attribute-plist context
     attribute-value name

[Generic Function]


XMLP:construct-construction-context context component

[Generic Function]


XMLP:construct-content-sequence context
     content-sequence

[Generic Function]


XMLP:construct-elem-property-node context
     prototype children

[Generic Function]


XMLP:construct-element-name context
     name attr-plist+ns-node-sequence

[Generic Function]


XMLP:construct-element-node context name

[Generic Function]


XMLP:construct-ns-node context attribute-value name

[Generic Function]


XMLP:construct-string-attr-node context
     attribute-value name

[Generic Function]


XMLP:|Content-Constructor| context
     CDSect CharData Comment Element ParsedReference
     Pi Reference

[Generic Function]


XMLP:|ContentSequence-Constructor| context
     content-sequence

[Generic Function]


XMLP:|Document-Constructor| context
third-misc-sequence prolog root

[Generic Function]


XMLP:|Element-Constructor| context
     content-sequence end-tag start-tag

[Generic Function]


XMLP:|ExtParsedEnt-Constructor| context
     content-sequence text-decl

[Generic Function]


XMLP:|Pi-Constructor| context literal target

[Generic Function]


XMLP:|PiCharData-Constructor| context character-data

[Generic Function]


XMLP:|STag-Constructor| context
     attr-plist+ns-cons-sequence name

[Generic Function]

An application may avail itself of this interface by specifying a context instance value for the keyword argument :construction-context to the XMLP:document-parser function. Where no value is specified, the document instance is used initially and is supplanted by the respective element instances over their respective extent. The parser itself incorporates methods for these functions specialized accordingly to generate a document model.

Construction Contexts

Use of the construction context interface is demonstrated in the two classes XMLP:null-construction-context and NOX:sax-construction-context.

Should an XMLP:null-construction-context instance be specified as the construction context, the parser produces a NULL result.

XMLP:null-construction-context

[Class]

Where an XMLP:null-construction-context instance is furnished as the :construction-context argument to XMLP:document-parser, the respective specialized constructors generate null values for all reduction results.

In order to effect this, the context specializes the following functions with methods which return nil:

|Attribute-Constructor| |AttCharData-Constructor| |CharData-Constructor| |CDataCharData-Constructor| |Comment-Constructor| construct-construction-context |Document-Constructor| |Element-Constructor| |ExtParsedEnt-Constructor| |Pi-Constructor| |PICharData-Constructor| |STag-Constructor|


The NOX:sax-construction-context class implementes methods for the low-level construction methods which direct a SAX1-equivalent parse event stream at the context instance's bound consumer property. The class is the basis of the bridge class used to parse RDF documents by driving the WILBUR RDF parser through its NOX:sax-consumer interface. Note that, in comparison to an orthodox SAX interface this is a hybrid event/tree interface in that various atomic properties and events are accumulated and passed to the event consumer as instances. See below for an example which uses it to parse RDF.

NOX:sax-construction-context

[Class]

Where a sax-construction-context instance is furnished as the :construction-context argument to XMLP:document-parser, the respective specialized constructors direct a SAX1-equivalent parse event stream context instance's bound consumer property.

In order to effect this, the context specializes the following functions with methods which return nil:

|CDataCharData-Constructor| |CharData-Constructor| construct-construction-context construct-attribute-plist construct-ns-node |Document-Constructor| |STag-Constructor| |Element-Constructor| |ExtParsedEnt-Constructor| |Pi-Constructor|

Event Consumer Interface

For convenience, the NOX:sax-consumer interface is summarized below. Note that this constitutes a SAX-1 equivalent interface with additional support for namespaces.

NOX:char-content
(self NOX:sax-consumer) (char-content string) mode

[Generic Function]

NOX:end-document
(self NOX:sax-consumer) mode

[Generic Function]

NOX:end-element
(self NOX:sax-consumer) (tag open-tag) mode

[Generic Function]

NOX:proc-instruction
(self NOX:sax-consumer) (tag proc-instruction) mode

[Generic Function]

NOX:start-document
(self NOX:sax-consumer) locator

[Generic Function]

NOX:start-element
(self NOX:sax-consumer) (tag open-tag) mode

[Generic Function]


[Top]

Examples

A stream-based RDF parser demonstrates one way to use this event interface. The implementation specializes the WILBUR:rdf-parser class to use a NOX:sax-construction-context, introduced above, as a SAX-equivalent driver to generate its parse events. The source to specialize the RDF parser's event producer class is minimal.

xml:demos;rdf;rdf-inline-parser.lisp
;;; -*- package: WILBUR; Syntax: Common-lisp; Base: 10 -*-



(in-package "WILBUR")



;; an xmlp-based rdf parser drives the wilbur event-based counterpart based on

;; inline parser construction operations 



(defClass rdf-xmlp-parser (rdf-parser)

  ()

  (:default-initargs

    :producer (make-instance 'nox::sax-construction-context

                :consumer (make-instance 'rdf-syntax-normalizer))))

;;

;;

;;

;;

;; the top-level parse function



(defGeneric parse-db-from-xmlp-stream (source &rest options)

  (:documentation

   "generate an rdf database from an input source.

    uses an rdf parser specialized to translate xmlp parse events into the required

    SAX-equivalent events.

    includes a somewhat redundant method to map a string source to an URI as the

    rdf parsing interface required that before the xmlp parser itself is called. ")

  (:method ((source t) &rest options)

           (apply #'parse-db-from-stream source (xqdm:uri source)

                  :parser-class 'rdf-xmlp-parser

                  options))

  (:method ((source string) &rest options)

           (cond ((char= (char source 0) #\<)

                  (apply #'parse-db-from-xmlp-stream

                         (make-instance 'vector-input-stream :vector source)

                         options))

                 (t

                  (apply #'parse-db-from-xmlp-stream (xutils:make-uri source)

                         options)))))



:EOF

The implementation for the NOX:sax-construction-context class supplants numerous construction operators by specializing them to operate to the exclusion of the parser's internal methods. The excerpts below demonstrate how it supplants, respectively, operators which the xml parser would use to generate nodes in a document model (XMLP:|CharData-Constructor|, XMLP:|Element-Constructor|, and XMLP:|STag-Constructor|) and one used to manipulate properties in the parsing context (XMLP:construct-ns-node).

xml:demos;sax;sax-construction-context.lisp (excerpted)
;;; -*- package: NOX; Syntax: Common-lisp; Base: 10 -*-



(in-package "NOX")

;;; ...
;;; the xmlp:|CharData-Constructor| method passes the character data through to

;;; the event consumer through its char-content method. it returns nil to indicate

;;; that the xml-parser itself should produce no result for this component.
(defMethod xmlp:|CharData-Constructor|

           ((context sax-construction-context) (data string))

  (setf data (collapse-whitespace data))

  (when (plusp (length data))

    (char-content (sax-producer-consumer context)

                  data

                  (sax-consumer-mode (sax-producer-consumer context)))

    nil))

;;; ...

;;; instead of binding the namespace prefix, as the parser's default method would,

;;; the specialization simply returns the properties. the parser eventually furnishes

;;; them together with attribute properties to the call to xmlp:|STag-Constructor|
(defMethod xmlp:construct-ns-node

           ((context sax-construction-context) attribute-value name

            &optional (colon-position (position #\: name))

            &aux ns-name namespace)

  (setf ns-name (xqdm:value-string attribute-value))

  (unless (stringp ns-name)

    (xqdm:xml-error "namespace name syntax error: ~s: ~s." name attribute-value))

  (when (and colon-position (zerop (length ns-name)))

    (xqdm:xml-error xqdm:|NSC: No Null Namespace Bindings| :name name))

  (setf namespace (xqdm:find-namespace ns-name :if-does-not-exist :create))

  (xmlp:call-with-name-properties

   #'(lambda (&key local-part &allow-other-keys) (cons local-part namespace))

   name :colon-position colon-position :namespace xqdm:*xmlns-namespace*))
;;; ...

;;; the distinction between the specialized method for xmlp:|Element-Constructor|

;;; and that for xmlp:|STag-Constructor| demonstrates how these constructors

;;; interact with the xml parser's internal state. where the element constructor

;;; passes the event through the consumer's end-element method and produces no result,

;;; the xmlp:|STag-Constructor| specialization not only generates a start-element

;;; event, it also returns the resulting event instance. which xml parser then collects

;;; among the terms in the Element phrase and supplies to the call to

;;; xmlp:|Element-Constructor|
(defMethod xmlp:|Element-Constructor|

           ((context sax-construction-context) (content* t) etag stag)

  (when etag

    (let ((close-tag (make-instance 'close-tag)))

      (setf (tag-counterpart stag) close-tag

            (tag-counterpart close-tag) stag)))

  (end-element (sax-producer-consumer context) stag

               (sax-consumer-mode (sax-producer-consumer context)))

  nil)



(defMethod xmlp:|STag-Constructor|

           ((context sax-construction-context) attr-plist+ns-cons* name)

  (let ((tag (make-instance 'open-tag))

        (namespaces nil)

        (attributes nil))

    (xmlp:call-with-name-properties

     #'(lambda (&key namestring local-part namespace &allow-other-keys)

         (flet ((tag-attribute (&key name att-value)

                  (xmlp:call-with-name-properties

                   #'(lambda (&key local-part namespace &allow-other-keys)

                       (cons (concatenate 'string (xqdm:namespace-name namespace)

                                          local-part)

                             (xqdm:value-string att-value)))

                   name))

                (tag-namespace (name value)

                  (cons (if (string-equal name xqdm:*xmlns-prefix-namestring*)

                          nil

                          name)

                        value)))

           (setf (token-string tag) (if (eq namespace xqdm:*null-namespace*)

                                      local-part

                                      (concatenate 'string

                                                   (xqdm:namespace-name namespace)

                                                   local-part))

                 (tag-original-name tag) namestring)

           (mapcar #'(lambda (attr-plist+ns-cons)

                       (cond ((consp (rest attr-plist+ns-cons)) ;; an attribute

                              (push (apply #'tag-attribute attr-plist+ns-cons)

                                           attributes))

                             (t

                              (push (tag-namespace (first attr-plist+ns-cons)

                                                   (rest attr-plist+ns-cons))

                                    namespaces))))

                   attr-plist+ns-cons*)

           (setf (tag-attributes tag) attributes

                 (tag-namespaces tag) namespaces)))

     name)

    (start-element (sax-producer-consumer context) tag

                   (sax-consumer-mode (sax-producer-consumer context)))

    tag))
;;; ...

As a side note, it is also possible to drive the event-based interface to the RDF parser by generating a parse event stream while traversing a document model. This practice is demonstrated by the WILBUR:rdf-dom-parser implementation, which is analogous to that for NOX:sax-construction-context.


[Top]

© setf.de 2003