[unrev-II] Editing an arbitrary XML document

From: Eric Armstrong (eric.armstrong@eng.sun.com)
Date: Fri Jul 21 2000 - 19:20:28 PDT

  • Next message: Markus Fleck: "Re: [unrev-II] Editing an arbitrary XML document"

    I hate when I overlook the obvious. Using the solution
    I presented earlier, a DTD (or schema) is *always* necessary
    for editing. I think I've found a better solution, though.
    (See the last section of this long article.)

    The goal is to allow an arbitrary editor to edit an arbitrary
    XML document in such a way that the editor presents a "clean",
    outline-view of the document that is easily edited. To do that,
    it needs to distinguish structural (block structure) elements
    from "inline" (part of the flow of the text) elements.

    XML's inability to distinguish those two very different kinds
    of elements means that outline-oriented editors display
    everything as an outline, including tags like <b>, and <i>.
    On the other hand, editors that eschew the outline approach
    require editing stylesheets for every document type in order
    to properly render the information.

    Neither solution is ideal. The all-outlines, all-the time
    approach produces a very unsatisfactory editing experience.
    But the requirement for adding stylesheets adds to the complexity
    of authoring a document, and makes it harder to edit the same
    document in different editors (even with XSL, the commands used
    to control rendering can varying, so a stylesheet that is valid
    for one editor may not work with another).

    In addition, since XML has no ability to mark a tag as
    "inlinable", it naturally has no concept of "mixed content
    (text plus inlinable tags) followed by structural tags". But
    that is precisely the nature of every structured document in
    existence! In any document, a heading consists of text followed
    by subheadings. Although text will naturally occur within
    subheadings, no text occurs between subheadings.

    But the mixed content model allows text between subelements.
    As a result there is no validation mechanism that lets you
    ensure that text and inlinable elements occur only before
    structural elements, and never between them. Circumventing
    that limitation requires the definition of extra elements.
    So DocBook, for example, defines:
         <TITLE>The Section Title

    But note how the dual structural elements <SECT1> and <TITLE>
    conspire to consume both vertical and horizontal space. Of the
    two, the vertical space is more costly to the "outline view"
    of the data. But pushing the text further to the right is costly
    as well.

    The impact of the issues go well beyond simple aesthetics,
    however. At the moment, the world's data is divided into
    several format categories:
      * easily viewed and edited plain text
      * hard to view (proprietary) structured formats
      * very hard to view binary formats

    Of these formats, plain text is nearly ubiquitous -- because it
    can be easily displayed and editor using any number of tools,
    all of which are easily available.

    Structured formats add useful information, but are harder to
    work with because the require the appropriate editor (say, Word).

    Binary formats, like the data found in a database, are essentially
    unusable unless approached from within the database. Data in that
    format is the least accessible.

    In retrospect, it seems clear that HTML and XML owe much of their
    success to the fact that they are *plain text* markup languages.
    That makes it possible to display and modify data in that format
    using any available text editing tool.

    Using those tools, however, requires you to give up the advantages
    of structuring that XML provides. However, as we have seen, using
    an XML-aware tool puts XML-data into the same category as one of
    the proprietary structured formats -- you need special style
    controls to interact with the data effectively.

    However, if those problems can be solved, then an arbitrary XML
    editor could conceivably edit an arbitrary XML file, and do so
    intelligently. The result would be useful, outline-oriented
    editors that take advantage of XML's structure without requiring
    a lot of customizing.

    If that solution *can* be achieved, then XML may well become as
    ubiquitous in the future as plain text is today. It could even
    supplant plain text, in the same way that plain text replaced
    those nice, safe punched cards -- the ones you never had to worry
    about losing if the computer disk crashed.

    But in addition to a desirable ubiquity, the ability to edit an
    arbitrary XML document in semi-intelligent fashion makes it more
    reasonable to design systems that rely on XML for input, and which
    deliver XML as output. Mail systems, bookmark files, and various
    other systems can then afford to make XML central to their
    operation without having to take special measures to make sure
    taht users can do the requisite editing.

    Solving the "intelligent editing" problems alluded to earlier
    therefore has a major impact on both the ubiquity and the utility
    of storing data in XML.

    Prior Solution
    The first solution to the problem of editing an arbitrary
    XML document I identified goes like this:

      * If an element uses the mixed-content model
        (where "mixed content" == "text + other elements)
        then assume every element within it is an
        inline element. (In the absence of stylistic
        controls that say, for example to treat <def> the
        same as <i>, the tags <def> & </def> could be
        rendered as immutable tokens -- selectable but not

        Result: <node>A <b>bold</b> word
        Instead of:

      * If an element does *not* use the mixed-content
        model, and it's first subelement *does*, then
        ignore the subelement and display the subelement
        data as though it belonged to the element. (When
        editing, be sure to save changes in the subelement.)

        Result: <SECT1>My Book on Me
                   <SECT2>Where I was Born
        Instead of:
                   <TITLE>My Book on Me
                      <TITLE>Where I was Born

    It's a clever solution. It allows for the most reasonable
    outline-oriented editing of XML data, without requiring
    a lot of intelligence on the part of the editor.

    But it does require *some* intelligence. First, it
    requires a DTD. Otherwise, situations arise in which
    the editor cannot determine what is mixed content.

      [In the absense of DTD, the editor could try
       inspecting the tree to see if any text exists.
       If text does exist, the answer is clear. But if
       no text exists, it is unclear whether the element
       represents a blank line or a structural block.
       Similarly, when no text follows an element, that
       element could be the either first structural
       element, or the last inline element in the text.]

    Second, the editor requires the ability to parse a DTD,
    to determine which elements use the mixed content model.
    Unfortunately, DTD parsing is *not* part of the XML 1.0
    standard. (I believe it may be rectified in the next
    version of the standard.)

    What that means is: There is no API which exposes the
    contents of the DTD, so there is no way to easily
    determine if a given element uses the mixed content model,
    or not. Except by parsing the DTD. Or, if a schema was
    used, then the schema must be parsed. (An easier job than
    parsing the DTD, but a different job -- and one that must
    be repeated for each of the schema standards!)

    At a minimum, then, it must be possible to identify elements
    that use the mixed-content model to do an adequate job of
    outline-based editing. (Even better would be a clear
    distinction between inline and structural elements, but
    mixed-content will do in a pinch.)

    The ability to edit an arbitrary file therefore depends on
    the ability to parse the DTD or schema flavor of the week.
    That in turn means that the DTD (or equivalent schema must
    be present).

    Finally, it means that until an API is available that exposes
    the needed information, widespread use of XML editors is
    unlikely, due to the extra complexity imposed by DTD --
    complexity which can only be offset by defining an extra
    stylesheet for every DTD, purely for editing purposes.

    There are three:
      * Standardize on a set of structural elements.
        For example: <BLOCK>, <SECT*>, <DIV>, <NODE>.
        (Not really good for data sets, though, where
         most elements are structural, rather than mixed.)

      * Wait for a version of the XML standard that provides
        an API for DTD/schema acess, and require the DTD/schema
        to be present when editing.

      * Standardize on a set of inline elements
        For example: <b>, <i>, <a>, etc.
        Then add the equivalence operation (<def>==<i>)
        in the editor, so that tags can be redefined as
        "inline" as needed (depending on file extension).

    The last solution probably makes the most sense. It provides
    a simple mechanism for stylistic control (without requiring a
    style sheet); the editor can easily select the equivalence set
    based on a file's extension; and it tells the editor how to
    render the inline data.

    Need technology solutions for your business?
    Respond.com will Help!

    Community email addresses:
      Post message: unrev-II@onelist.com
      Subscribe: unrev-II-subscribe@onelist.com
      Unsubscribe: unrev-II-unsubscribe@onelist.com
      List owner: unrev-II-owner@onelist.com

    Shortcut URL to this page:

    This archive was generated by hypermail 2b29 : Fri Jul 21 2000 - 19:27:29 PDT