By tomorrow morning (Saturday), this document:
Encoding Source in XML: A Strategic Analysis
will be available on my web site at
http://www.treelight.com/software/encodingSource.html
It's 8500 words long. There is a copy there now,
but I have a couple of corrections and attributions
to add to it yet.
Hopefully, in the near future we'll have a link to it
from the eXtendDE page at
http://eXtenDE.sourceforge.com.
------------------------------------------------------------------------
Want insight into hot IPOs, investing strategies and stocks to watch?
Red Herring FREE newsletters provide strategic analysis for investors.
http://click.egroups.com/1/5176/5/_/444287/_/961816664/
------------------------------------------------------------------------
Community email addresses:
Post message: unrev-II@onelist.com
Subscribe: unrev-II-subscribe@onelist.com
Unsubscribe: unrev-II-unsubscribe@onelist.com
List owner: unrev-II-owner@onelist.com
Shortcut URL to this page:
http://www.onelist.com/community/unrev-II
By Eric Armstrong, 23 Jun 2000
(___ words)
Summary
Storing source code in XML provides many benefits that cannot be achieved any other way, including hierarchical structures, literate programming style, links to explanations, choosable display styles, and the elimination of braces, semicolons, and end-comment marks. The advantages are described, and possible strategies for encoding source in XML are evaluated. A "plain" encoding style using one, or at most two element types is found to be most desirable for editing, despite the value of using multiple elements (one per language component) for compiling and other automated processing.
This document looks at the reasons and methods for encoding program source code in XML. It also describes the major issues that must be solved for the encoding to be successful.
This document contains the following sections:
Storing source code using XML data structures has a number of advantages that are difficult to obtain using plain text. Those advantages include:
public class MyClass { + public String myFirstMethod() { + public String mySecondMethod() { ...
public class MyClass { + // Variables + public String myFirstMethod() { + // Access the database + // Calculate the result + // Check for validity + // Return the result + public String mySecondMethod() { ...
For those reasons, it makes sense to consider an XML encoding. The question is, what format to use for the XML data structures?
The goal is to encode source language documents in XML in such a way that they are:
Their are many possible strategies to use for the encoding. Several of them were discussed on the extende.sourceforge.com developer's list. [Note: In the analysis that follows, I have tried to acknowledge the contributors to the discussion. If I have overlooked anyone, send a note to the developer's list, and I will correct it forthwith!]
Those strategies include:
For editing and acceptance by developers, I suspect that is the plain node (outline) style that will work best. However, there is already an excellent definition for a more language-centric format at http://sds.sourceforge.com. Their common source format (CSF) will be useful for many automated tools. It may be the developers will be won over to this format, and will gravitate towards it, once really good XML editors make their way to the market. However, as will be argued below, even with the best of editors, the task of editing will be rendered more difficult with such an encoding strategy. Since use of XML for source code depends in large part upon its acceptance by developers, I suspect that a more "natural" approach will have the highest rate of acceptance.
The remainder of this paper discusses the short term vs. the long term outlook for encoding source in XML, after which it provides an analysis of the various encoding options. It then lists the salient issues that have to be taken into account when storing and processing source code in XML structures, and ends with a look at the shortcomings of XML for that purpose. (With those shortcomings rectified, the XML-encoding option begins to approach elegance. But those solutions are not anywhere on the horizon.)
The short term presents some interesting complexities that ideally will not exist in the long term. In the short term, we are faced with the fact that source code developers have been using plain text editors and a "plain text" encoding of source code for the last 40 years or more, ever since computer languages were first invented. Developers are, therefore, used to that process. But aside from the normal human resistance to trying something new, developers are understandably wary of storing source code in new forms.
Suppose, for example, that you tried some early development environment that used databases and binary formats to store source code. You might have found that other tools you took for granted (like "grep", a search tool you used to find every file that contained a particular variable) no longer worked. Although it was part of your working style when source was in flat files, you suddenly discovered that you could no longer use that tool. You would have felt like a carpenter whose hammer was missing. Powerless.
Or you might have entrusted your source code to that system, and then found you could not easily share it with others. That prevented them from building on your work. Or, worst of all, you might have found that a power failure corrupted the database and wiped out all your code -- not just the file you were working on at the time. (That example is not too far-fetched. I was using a production email program for over a year, when the database got corrupted. It turned out that the manufacturer had not thought to include any recovery or analysis tools. So every email in that system was lost.)
So, developers have been historically adverse to trying new formats, with considerable justification. However, there is a very real possibility that XML will become the ubiquitous data format of the next century. If that occurs, developers will be using XML editors to write documents, send email, and do most other tasks they perform in their working day. If that happens, the resistance to storing source in XML will likely diminish.
After all, it must have felt very strange to the first programmer who typed a program into plain text. It must have seemed much less safe than plugging wires into a board, or punching holes in cards. One of those programs was real. It was solid. You could hold it in your hand, and know it was safe. Putting it into a computer file would have taken a whole lot of trust. Over time, though, experience with plain text editing proved reliable, and the advantages over punched cards were enormous, so the new encoding medium took root. I expect a similar acceptance curve for source code in XML.
This approach was proposed by _____ on the eXtenDE developer's list. In reality, of course, this is more of UI (user interface) issue than an encoding issue. Any encoding scheme could be treated graphically, so that the system had more of a UML flavor than a source-language flavor. The possibility of using graphics is an interesting one, though, that deserves to be addressed. This seemed like a reasonable place to do so.
Despite the value of using UML diagrams for understanding and communicating the broad strokes of a system design, there are a few factors that seem to imply it's unsuitability for large projects. Among those factors are:
- Graphic Complexity
- Graphics work well for visualizing small systems, or a very high-level view of a large system. All of the demos for graphical systems do one or the other. But when systems get large and complex, full graphical treatments break down very quickly. What works best, of course, is the combination of graphics and text -- diagrams of specific subsections, attached to text (in the case of a diagram) or source code (in the case of a program). XML encoding of source, and the potential for linking and image-inclusion that results, therefore provides the most likely prospect for dealing with complex systems effectively.
- History of Hardware Development
- Hardware development started out as a graphic process. For decades, graphic tools were improved and bettered, so that designers could image their designs and have them translated into silicon. But in the last decade, the trend has been away from graphic designs, and toward languages. The reasons include the inability to visualize 7-layer boards with 3-dimensional interconnections, as well as the ability to easily reuse various routines stored in source-language form.
- Lack of Hierarchical Graphics Tools
- While multi-dimensional interconnections appear unavoidable in hardware design, the goal of object-oriented development is the production of more modular systems. In principle, then, "visualization difficulty" need not be a limiting factor. In practice, though, there is a definite lack of good, hierarchical graphic tools at the software designer's disposal.
Typically, when we think of a graphic hierarchy, we think of a tree of graphic objects, with lower level objects connected to to parent objects by lines. But in the context of a design, a graphic hierarchy requires nesting. At a high level, you might see 3 or 4 major components connected to each other. "Drilling down" into one of them might then show the subsystems comprising that component.
However, mere "drilling down" is not enough. The outliner equivalent of that is like a directory tree where all you see is the topmost level of directories, and when you "drill down", that view is replaced by the directories it contains. Although the display system is "hierarchical", the loss of surrounding context at each view places too many demands on the viewer, who must keep the interconnections in his or her head in order to relate what the current view to other views.
What is needed in such systems is the ability to view an diagram at multiple levels. Just as you can expand or collapse an outline to see one level deep or multiple levels deep (as in a directory tree), you need the ability to display multiple levels of a graphic hierarchy. At the top most level, then, you might see 3 or 4 major components with "thick pipes" between them. But when you expand that view, you would see the objects inside those components, as well. The interconnections would then show the communication paths between those objects. Those smaller communication paths would also be contained in the larger "thick pipes", so that they were organized and labeled at the higher level as well as at the lower level.
For example, imagine a thick pipe labeled "input". That pipe might contain a path from a text object that goes to a text processor, and one from a scrollbar that goes to a percentage-value processor. The major components in this case might be UI, Processing, and Document, with graphical widgets in the UI component, and various processors in the Processing component, and a database system or file system in the Document component.
To be fair, it must be also recognized that there exist UML tools like Together/J that do "round-trip" engineering -- from UML diagrams to source code, and from source code edits back to UML diagrams. There is also a rabid cadre of engineers that use those tools. However, even though the UML tools show the structure of the system, they do not encode all of it's details. Given the present state of the art, it would be simply too difficult to encode every line of every method graphically. And if you did, you probably wouldn't be able to live with the result.
If good, hierarchical graphic tools existed, treating programs graphically might be conceivable. However, it is also likely that the complexity of real life programs would make them too difficult to follow in a graphic layout. Still, there is an important potential for graphics. If the design patterns could be selected from a graphics palette, and then instantiated in one's code, it would simplify development tremendously. And if the code linked back to the explanation of those patterns, it would be all the more understandable.
Another alternative recommended by _______ on the developers list was that of leaving the source code intact and making an XML structure that consists of pointers to the source code. That approach lets developers continue to use their plain text editors, while providing some of the benefits of hierarchical structuring and linking.
That approach is being used, in fact, by current tools that convert source code to HTML documents. After translation, program elements become links. So, when a method is invoked, the method name links to that method. The link can then be traversed to see the code on comments for that method -- in particular, to see the required parameters, along with their definitions and datatypes. Similarly, a variable can link to the place where it was defined, along with the comments that explain what it is for.
That such tools are viewed as highly useful, I think, points to the paucity of the plain text systems that developers are currently using. Those tools are valuable, just because they provide the linking capability that is so egregiously missing from plain text. But by the same token, they do not provide the benefits of hierarchical structuring.
Using external pointers in an XML file would provide the same benefits as translating to HTML, with the addition of adding hierarchical viewing capabilities. That mechanism would, as result, represent an incremental advantage over current systems. With a really good XML editor (which are becoming increasingly available), the developer would have the ability to collapse and expand sections, and make it possible to browse a more "literate" version of the code. However, that proposal causes serious difficulties with respect to editing.
The first, most obvious disadvantage is that existing XML editors would be useless. Changing the XML would have no effect on the underlying source, so making changes in a normal XML editor would be pointless. That means a custom editor would be required. However, that editor would be doubly complicated. It would not only have to make changes in the XML structures, it would have to replicate those changes in the text version of the document. So, while such a system would improve one's ability to view source code, it would not constitute the "next generation" integrated editor/browser advance that will make it possible to develop code more efficiently.
While defining a generic, "uberlanguage" that could be translated into Lisp, Smalltalk, Python, or Java is clearly not feasible, if not computationally insolvable, the folks at the Source Development System (sds.sourceforge.com) have come with an interesting approach. They have defined a generic DTD (Document Type Definition) for a family of similar procedural languages, including Java, C, and Python.
Around that definition, they are building and/or planning a whole suite of development tools, including compilers, debuggers, syntax checking tools, pretty printers, and documentation generators. However, as valuable as that tool is for automated processing, I suspect that it poses some problems, as well -- mostly with respect to editing.
One problem with that standard (for editing, not for any other purpose) is that it appears to throw away the extra spaces and newlines that add to readability. "Pretty printers" can make the newlines appear in some consistent manner, and it would be possible for a "pretty printing" (style-controllable) editor to do so, as well. However, spaces that were added in order to make variable names and comments on them line up, for example, would disappear.
Note:
The desire to dictate style, for example with respect to spacing, lies in direct contrast to the desire to make the style viewer-controlled, as for example with indentation and line breaks. There is a tension between these two requirements that must be taken into account in the final design of the system. Possibilities for resolving the tension include separating the two concerns (line breaks and indentation controlled by user, extra spaces controlled by author), or creating even more intelligent display options. For example: "line up variables and comments on adjacent lines, when doing so will keep the results on a single-displayable line" and "when wrapping an assignment statement onto multiple lines, indent successive lines so that they start to the right of the assignment symbol (an equals sign, in Java)".
But if we assume that display problems are solvable, or at least livable, the editing problems still remain. The major problems stem from the need to continually specify the element type when adding statements to the program. You could select them from a palette, but continually moving the cursor to get them is going to be a drag. Or you could right click and select from a list, but that is still a lot of cursor movements for every single statement in a program. Alternatively, you might have control-key combinations to select elements. But that makes a lot of control-key combinations to memorize. Besides, isn't it easier to type "if" that hit "ctrl+I"?
Note:
One interesting solution to this problem is for newly added lines to always default to some generic element, say <node>. That element might then be changed by the editor depending on what the user types. Blank lines, comments, and language elements would be recognized, but mis-typing a language element could be identified immediately. However, here again we are talking about the need for a language-specific editor. Generic XML editors would be of no use.
In addition, DTD-directed editors may disallow intermediate invalid states. That makes it more difficult to move things around and insert things in the order you think of them, as opposed to the order the program needs them in. Many a syntax-editor has fallen into disrepute because it did not allow the kinds of invalid states you typically move through when editing a program. You want to find out about them before you finish, but you don't want to break your train of thought to accommodate the editor during the writing process. (A DTD-directed editor that did it's syntax checks at the end might solve those problems, but it's not clear how many do, or will, operate in that manner.)
A more serious problem with using a generic specification for editing is that it may allow one to express statements that either cannot be translated into the current language, or cannot be done so efficiently. Significantly, even the CSF format at sds.sourceforge.com expects to receive source code input in plain text files. It does not appear to be intended as an editing format. So, even though a existing Java program or Python program can be nicely expressed in that format, when you turn it around and go the other way, you may run into problems.
When you go from plain source to CSF, the plain source is already a legal program. So, if the CSF format is a union of Java and Python constructs, the result of translating a Java program into CSF would only contain Java constructs. It would therefore translate back nicely. But if you edited a program using that DTD, you might add Python-based constructs to the program. Those constructs might not translate at all (although one hopes that CSF's developers have made sure that they do), or else they may represent a construct which is easily expressed in Python, but which does not map into Java code nicely. The result could be a program that performs inefficiently, or which is much harder to read in plain text form , for someone accustomed to Java idioms.
In summary, CSF appears highly beneficial for automated processing. But the jury is still out on the mechanics of editing. Even if the structural problems can be solved, there is still the matter of usability and programmer acceptance. Over time, it is possible that all of the problems will be solved. For the next five or six years, I suspect that a "plain encoding" that looks more like a standard outliner will have greater appeal when it comes to editing.
Rather than defining a generic language, one might choose elements that have a one-for-one correspondence with structures in a chosen language. For example, the <if> tag would encode Java's if statement, the <catch> tag would encode an exception-handling block, etc. This approach would make it impossible to define programs that were either impossible to translate, or impossible to translate efficiently. However, it would suffer from all of the other problems attendant upon a language-based encoding.
Even if this approach were desirable, however, the existence of the Common Source Format makes it moot. The few benefits that would be derived from a single-language encoding pale beside the benefits to be derived from using the existing standard. Using CSF makes a lot more sense. In addition to saving the time and work necessary to define the vocabulary you need, using CSF makes it possible to utilized any editors or other tools that are built around that standard.
The alternative to using a language-specific or even a generic-language encoding is to use one that is language-neutral. If the document contains only <node> elements, for example, the DTD becomes the picture of simplicity -- at least conceptually. Although it will become more complex as the issues discussed in the next section are addressed, it will still be many times simpler than a language-oriented DTD.
In effect, such an encoding uses XML to replicate the outliner utilities that were somewhat popular in the mid-eighties. But XML adds the capability for links and attributes that the structured encoding needs to interact well with utilities that are driven by plain text. For example, compilers and programs currently produce errors that give line numbers. Eventually, it would be nice to see them converted so they provide XML pointers that could be clicked to go directly to the source. But in the meantime, it will be necessary for the editors to provide "go to line" functions that can be used in place of links. Those line numbers will need to be stored as attributes in the XML structure, or else calculated on the fly in a way that accounts for multiple-line wrapping when the plain text version is generated from the XML.
Using such a "plain" encoding makes it possible to use standard XML editors on the source. That makes it possible for others to read the code (and add comments, for example), without requiring a custom editor to do it. (An editor that understands line numbers will still need be needed to translate the line numbers on compilation and runtime error messages, but that is fairly trivial hack.)
Such an encoding will also feel the most comfortable to current-day hackers. The editor will already be introducing new hierarchical display and manipulation capabilities that will take some getting used to. Plus, syntactic elements like braces and semicolons will have disappeared. At least the programmer will still be able to type "if" and "else" to enter statements!
So a plain encoding seems to be the most desirable. For Python, it really seems like the way to go. For Java, though, one more issue remains: Is a special element type needed for Javadoc comments? (Javadoc comments start with /** instead of /*. They are processed by the Javadoc program to generate API documentation.) That question will be taken up at the end of the next section, which covers encoding issues.These are the major issues that must taken into account when encoding a source language in XML.
As described in the previous section, under the headings of "Preserving spacing" and "Comment styles", XML has two major shortcomings that make the process of source encoding more difficult:
The second problem in particular affects every XML document -- not just source code documents. This section examines those limitations in a bit more detail.
As we saw earlier, the need to handle special characters, as well as to preserve spaces and line breaks, implies the need for a continuous series of CDATA sections throughout the document. Virtually every element will have a CDATA section, so the section-delimiting tags <[CDATA[...]]> will appear over and over again in the XML structure. That will run the XML structures something you don't really want to edit by hand, although you could if you needed to.
One possible solution would allow the DTD or schema specification to declare "this element always contains unparsed character data (CDATA)". The parser would then proceed to ignore any any all special characters, and pass on any line breaks, until it saw the exact sequence of characters necessary to terminate that element.
The problem arises, of course, that you may want to discuss "</node>" inside of a node element, without terminating that node. Possible solutions to that problem include escaping the / character, as <//node> or <&slsh;node>. However, since those instances would be exceedingly rare, the problem would not arise very frequently.
A bigger problem concerns the interaction the "automatic CDATA" mechanism and the solution to the problem described in the next section. First, let's look at that problem...
As we saw in the section on comment styles, encoding source code in XML requires both <content> and <structure> elements in every node. The reason: There is no other way to make sure that no text occurs in what should otherwise contain only structure tags.
Background:
XML's "mixed content model" allows text and tags to be mixed. That's swell for a paragraph. It means that bold and italic tags can be mixed in with the text. The inverse is also true: It means that text can occur between tags in the file. So if structure can seen as: <b>...</b> ...some text here... <i>...</i>, that is, as a structure containing two elements that have text between them.
While that arrangement makes perfect sense in a paragraph, it doesn't make
any sense in a list. So this:
<li>...</li>Some text here...<li>...</li>
would mean what? That text is obviously not part of any list item, so it makes
no sense.
In XML, you can't allow any text in an element without allowing it everywhere in that element. If you allow text to occur at the beginning of an element, you have to allow it between any of the elements in that structure. Again for tags like <b>...</b>, that makes sense. But for tags like <li>...</li>, it doesn't. The difference between those two kinds of tags is the difference between content and structure. (In XHTML and DocBook, content tags are defined as inline tags. However, that distinction means nothing to the parser.)
Even without considering content tags at all, we still saw the need for distinguishing content from structure when we considered the "//" comment. The content of the element is the text that comes after it. The structure of that element consists of the programming language statements it contains.
However, in XML, if we were to define a single <node> that could contain both text and other <node> elements, then the "mixed data" specification would allow text to be freely intermixed between the subnodes. And that would not be a legal program! The only way to get around that problem is to introduce a <content> element under <node>. As we have seen though, that makes editing more difficult. A straight-forward display of the XML includes all the textless <node> elements, while a more intelligent display complicates the editor and requires the additional style controls.
The fact is, the problem affects every document, not just source code. Consider this document, for example. A heading consists of a text, followed by substructure elements like <p> or subheadings. Text certainly does not occur between the substructure elements -- only before those elements. (A heading may also contain various "inline" tags like <i> or links. So, to be complete, the concept of content needs to include those tags as well as text.)
DocBook is the SGML standard for defining books, articles, magazines, journals, and most any other kind of document you can name. The SGML (and XML) version of DocBook faced the same problem, which they solved in the same way. Each <head1> and <head2> tag, for example, contains a <text> tag that contains the content of that heading.
But adding an extra tag to solve the dilemma is not, in my opinion, the ideal solution. With more XML editors coming online every day, there is a real chance to turn structured XML data into the ubiquitous data/text format -- something that replaces plain text the same way that plain text replaced punched cards and plug boards. But one of the things that makes plain text so ubiquitous is that it is so easy to view and edit, with any number of tools designed for that purpose.
For XML to achieve the same level of ubiquity, editors and viewing tools have to be as readily available as their plain text cousins. If XML had the capability to declare as part of the DTD or schema that particular tags were inline, or content tags, in a way that a validating parser could verify, then XML might just achieve that ubiquity.
With such a mechanism, the difference between content and structure would be readily discernible. And then any XML editor could display the data and interact with it intelligently -- it would only need to distinguish content elements from structure elements to do the right thing.
Note:
Other alternatives include adding an attribute to each element definition, and adding that attribute to each and every data element in the file -- but that is an awful lot of extra work for something that could easily specified in the schema. Another alternative is to make all content an "attribute" of an element. The current XML specification does not allow that, however. The next version of XML apparently will allow that, though. The XML that results of course, may be the ugliest thing yet -- but at least the problem will be solvable in a way that gives any unspecified editor a chance of doing the right thing. (At the moment, most do not handle attributes nicely. But if a standard attribute like "content" or "text" were defined, perhaps they would do better.)
An interesting problem occurs when we try to solve both the CDATA and Content/Structure problems at the same time. The CDATA solution implies that when <node> is seen, it is only terminated by </node>. But the XML document contains multiple nested <nodes>. Meanwhile, one implication of the Content/Structure dilemma implies that the text of a section terminates when the first <node> (or other structure element) is seen.
Taking those two in combination, therefore, implies that the CDATA part of a <node> would have to be terminated by </node> or any structure element defined in the document schema. (In our case, that's just another <node>. But for a general XML solution, that could be structure elements like <h2> or <ol> in an XHTML document, or <size> and <color> in a order-entry document.
The impact of trying to combine both solutions at one time, then, means that many more tags besides </node> would have to be escaped in order to carry on a discussion about them. And that would lead to many more escapes than the original CDATA solution suggested. It may therefore be unwise to attempt both solutions at the same time. (The need for additional escapes arises regardless of whether the content exists as text under the element, or as an attribute of it.)
Of the two, the most pressing problem is the one that interferes with ubiquitous, intelligent editing of XML documents. That is the need for distinguishing content from structure in some standard way. Given that, the CDATA issue can be lived with for special cases like source code. In most other cases, it's not that big an issue.
The use of XML for encoding source language statements would be highly desirable. Using a "plain" encoding seems to be the most desirable format for editing, with a generic format like CSF coming in a close second -- and possibly (but not necessarily) overtaking it in the long term. Barring improvements in the XML standard itself, the desired structure looks like this:
<node> <content><[CDATA[...]]></content> ...nodes... </node>
In DTD parlance, the definition calls for an optional <content> element (or it could be required, but empty) and zero or more <node> elements:
<!ELEMENT NODE (CONTENT?, NODE*)>
<!ELEMENT CONTENT (PCDATA | &inline;)*
where &inline; is the definition of the inline tags defined in the XHTML DTD. (The inline tags won't be really needed as long as Javadoc comments are treated as CDATA sections, but could come in handy later on if more interesting structures are allowed.)
This archive was generated by hypermail 2b29 : Fri Jun 23 2000 - 20:26:31 PDT