I hate when I overlook the obvious. Using the solution
I presented earlier, a DTD (or schema) is *always* necessary
for editing. I think I've found a better solution, though.
(See the last section of this long article.)
The goal is to allow an arbitrary editor to edit an arbitrary
XML document in such a way that the editor presents a "clean",
outline-view of the document that is easily edited. To do that,
it needs to distinguish structural (block structure) elements
from "inline" (part of the flow of the text) elements.
XML's inability to distinguish those two very different kinds
of elements means that outline-oriented editors display
everything as an outline, including tags like <b>, and <i>.
On the other hand, editors that eschew the outline approach
require editing stylesheets for every document type in order
to properly render the information.
Neither solution is ideal. The all-outlines, all-the time
approach produces a very unsatisfactory editing experience.
But the requirement for adding stylesheets adds to the complexity
of authoring a document, and makes it harder to edit the same
document in different editors (even with XSL, the commands used
to control rendering can varying, so a stylesheet that is valid
for one editor may not work with another).
In addition, since XML has no ability to mark a tag as
"inlinable", it naturally has no concept of "mixed content
(text plus inlinable tags) followed by structural tags". But
that is precisely the nature of every structured document in
existence! In any document, a heading consists of text followed
by subheadings. Although text will naturally occur within
subheadings, no text occurs between subheadings.
But the mixed content model allows text between subelements.
As a result there is no validation mechanism that lets you
ensure that text and inlinable elements occur only before
structural elements, and never between them. Circumventing
that limitation requires the definition of extra elements.
So DocBook, for example, defines:
<TITLE>The Section Title
But note how the dual structural elements <SECT1> and <TITLE>
conspire to consume both vertical and horizontal space. Of the
two, the vertical space is more costly to the "outline view"
of the data. But pushing the text further to the right is costly
as well.
The impact of the issues go well beyond simple aesthetics,
however. At the moment, the world's data is divided into
several format categories:
* easily viewed and edited plain text
* hard to view (proprietary) structured formats
* very hard to view binary formats
Of these formats, plain text is nearly ubiquitous -- because it
can be easily displayed and editor using any number of tools,
all of which are easily available.
Structured formats add useful information, but are harder to
work with because the require the appropriate editor (say, Word).
Binary formats, like the data found in a database, are essentially
unusable unless approached from within the database. Data in that
format is the least accessible.
In retrospect, it seems clear that HTML and XML owe much of their
success to the fact that they are *plain text* markup languages.
That makes it possible to display and modify data in that format
using any available text editing tool.
Using those tools, however, requires you to give up the advantages
of structuring that XML provides. However, as we have seen, using
an XML-aware tool puts XML-data into the same category as one of
the proprietary structured formats -- you need special style
controls to interact with the data effectively.
However, if those problems can be solved, then an arbitrary XML
editor could conceivably edit an arbitrary XML file, and do so
intelligently. The result would be useful, outline-oriented
editors that take advantage of XML's structure without requiring
a lot of customizing.
If that solution *can* be achieved, then XML may well become as
ubiquitous in the future as plain text is today. It could even
supplant plain text, in the same way that plain text replaced
those nice, safe punched cards -- the ones you never had to worry
about losing if the computer disk crashed.
But in addition to a desirable ubiquity, the ability to edit an
arbitrary XML document in semi-intelligent fashion makes it more
reasonable to design systems that rely on XML for input, and which
deliver XML as output. Mail systems, bookmark files, and various
other systems can then afford to make XML central to their
operation without having to take special measures to make sure
taht users can do the requisite editing.
Solving the "intelligent editing" problems alluded to earlier
therefore has a major impact on both the ubiquity and the utility
of storing data in XML.
Prior Solution
The first solution to the problem of editing an arbitrary
XML document I identified goes like this:
* If an element uses the mixed-content model
(where "mixed content" == "text + other elements)
then assume every element within it is an
inline element. (In the absence of stylistic
controls that say, for example to treat <def> the
same as <i>, the tags <def> & </def> could be
rendered as immutable tokens -- selectable but not
Result: <node>A <b>bold</b> word
Instead of:
* If an element does *not* use the mixed-content
model, and it's first subelement *does*, then
ignore the subelement and display the subelement
data as though it belonged to the element. (When
editing, be sure to save changes in the subelement.)
Result: <SECT1>My Book on Me
<SECT2>Where I was Born
Instead of:
<TITLE>My Book on Me
<TITLE>Where I was Born
It's a clever solution. It allows for the most reasonable
outline-oriented editing of XML data, without requiring
a lot of intelligence on the part of the editor.
But it does require *some* intelligence. First, it
requires a DTD. Otherwise, situations arise in which
the editor cannot determine what is mixed content.
[In the absense of DTD, the editor could try
inspecting the tree to see if any text exists.
If text does exist, the answer is clear. But if
no text exists, it is unclear whether the element
represents a blank line or a structural block.
Similarly, when no text follows an element, that
element could be the either first structural
element, or the last inline element in the text.]
Second, the editor requires the ability to parse a DTD,
to determine which elements use the mixed content model.
Unfortunately, DTD parsing is *not* part of the XML 1.0
standard. (I believe it may be rectified in the next
version of the standard.)
What that means is: There is no API which exposes the
contents of the DTD, so there is no way to easily
determine if a given element uses the mixed content model,
or not. Except by parsing the DTD. Or, if a schema was
used, then the schema must be parsed. (An easier job than
parsing the DTD, but a different job -- and one that must
be repeated for each of the schema standards!)
At a minimum, then, it must be possible to identify elements
that use the mixed-content model to do an adequate job of
outline-based editing. (Even better would be a clear
distinction between inline and structural elements, but
mixed-content will do in a pinch.)
The ability to edit an arbitrary file therefore depends on
the ability to parse the DTD or schema flavor of the week.
That in turn means that the DTD (or equivalent schema must
be present).
Finally, it means that until an API is available that exposes
the needed information, widespread use of XML editors is
unlikely, due to the extra complexity imposed by DTD --
complexity which can only be offset by defining an extra
stylesheet for every DTD, purely for editing purposes.
There are three:
* Standardize on a set of structural elements.
For example: <BLOCK>, <SECT*>, <DIV>, <NODE>.
(Not really good for data sets, though, where
most elements are structural, rather than mixed.)
* Wait for a version of the XML standard that provides
an API for DTD/schema acess, and require the DTD/schema
to be present when editing.
* Standardize on a set of inline elements
For example: <b>, <i>, <a>, etc.
Then add the equivalence operation (<def>==<i>)
in the editor, so that tags can be redefined as
"inline" as needed (depending on file extension).
The last solution probably makes the most sense. It provides
a simple mechanism for stylistic control (without requiring a
style sheet); the editor can easily select the equivalence set
based on a file's extension; and it tells the editor how to
render the inline data.
Need technology solutions for your business?
Respond.com will Help!
Community email addresses:
Post message: unrev-II@onelist.com
Subscribe: unrev-II-subscribe@onelist.com
Unsubscribe: unrev-II-unsubscribe@onelist.com
List owner: unrev-II-owner@onelist.com
Shortcut URL to this page:
This archive was generated by hypermail 2b29 : Fri Jul 21 2000 - 19:27:29 PDT