[unrev-II] Collab Doc Rqmts, v0.6

From: Eric Armstrong (eric.armstrong@eng.sun.com)
Date: Mon May 08 2000 - 15:32:52 PDT

  • Next message: Eric Armstrong: "[unrev-II] Outline of OHS/CDS Requirements, v0.6"

    Requirements for a collaborative-document system.
    (Aprox. equiv to Open HyperDocument System.)

    Version History
    0.6 "Reusable" requirement added after "Hierarchical"
    0.5 "Partionable" requirement added under "general system rqmts"
    0.4 Use Case Scenarios
    0.3 Formatting, two additions
    0.2 Refinements
    0.1 Initial Version

    This is a lengthy document aimed at adducing the requirements for a
    subset of an eventual Dynamic Knowledge Repository (DKR). The subset
    described is for a collaborative document system, which Doug describes
    as an "Open HyperDocument System" (OHS). The goal of this document is to

    show how such a system fits into a DKR framework, detail its
    requirements, and point to a couple of extensions that move it in the
    direction of a full DKR.

    This document has the following sections:
      * Long-Range Goals
      * Motivation
      * Starting Points
      * General Characteristics
      * Outline of Operational Requirements
      * Summary of Data Structure Requirements
      * Use Case Scenarios
      * Future: Using an Abstract Knowledge Representation

      * v0.2 Thoughts from UnRev-II discussions and other additions
      * v0.1 First Draft

    Long-Range Goals
    A fully functional DKR will need to manage many different kinds of
       * documents
       * abstract knowledge representations
         (and inference engines)
       * predictive models
       * multimedia objects
       * programs of various kinds
         (search engines, simulations, applets)
       * data
         (spreadsheet files, database tables)

    It is likely, too, that different kinds of problem will required
    information to be organized in fundamentally different ways. For
    example, a DKR devoted to the energy problem might have major headings
    for the problem statement, real world data, tactical possibilities,
    strategic alternatives, and predictive models. On the other hand, a DKR
    devoted to building the next-generation DKR might have sections for
    requirements, design, implementation, testing, bug reports, suggestions,

    schedules, and future plans.

    Since the general outline of a DKR seems to depend on the problem domain

    it is targeted for, it seems reasonable to focus attention on the
    elements they have in common.

    This set of requirements will focus on what is perhaps the major common
    feature: Documents -- in particular, Collaborative Documents, and the
    need to interact via email to construct them.

    Other important areas that will need attention include the integration
    of multimedia objects (including animations, simulations, audio, video,
    and the like) as well as the critical functions of abstract knowledge
    representation, inference engines, model-building functions, and the
    integration of other executable programs. But here, we'll focus on
    Collaborative Documents.

    A wide variety of email and forum-based discussions occur on a host of
    topics every day. In each of these discussions, important information
    frequently surfaces, but that information is hard to capture where you
    need it.

    Document production systems, on the other hand, simplify the task of
    creating complex documents but make it hard to gather and integrate

    For example the DKR discussions have identified several possible
    starting points for such a system. That kind of feedback occurs
    naturally in an email system, as opposed to a document production
    system, but each of the pointers was buried in a separate email. It
    required lengthy search to gather them together (below), and the list
    may not even be complete!

    To act as a foundation for a DKR, a Collaborative Document System (CDS?)

    needs to combine the best features of:
      * Directory tree / outlining programs
      * Hypertext (links and formatting)
      * XML (inline references and other features)
      * Email systems
      * Forums and Email Archives
      * Document Database
      * Versioning Systems
      * Difference Engines
      * Search Engines

    Starting Points
    In the DKR discussion, we've seen pointers to several possible starting
    points for such a system. Those are contained in the References post, in

    the Bootstrap section. (They many possible starting points listed in the

    post desperately need short synopses and evaluations.)

    General Characteristics
    The lengthy list of starting points, the difficulty of creating it, and
    the rapidity with which it goes out of date, combine to suggest several
    obvious requirements for the system: It needs to be composed of
    information nodes that are hierarchical, mailable, linkable, and
    evaluable (more on those subjects in a moment).

    Each of those requirements leads in turn to other requirements. The
    major requirements are listed here and explained below:

    General Functional Requirements
      * Hierarchical
      * Reusable
      * Revisable
      * Versionable
      * Mailable
      * Multiple-Containment
      * Distributed
      * Administratable
      * Differencable
      * Linkable
      * Categorizable
      * Queryable
      * Evaluable
      * Collaborative
      * Attributive
      * Accelerative

    General Systemic Requirements:
      * Open
      * Extensible
      * Secure

    DKR Requirements
      * Firewalled
      * Didactic (a teaching device)

    The next three sections discuss those requirements in greater detail.
    Following that, there are three shorter sections:

      * Operational Requirements -- Highlights
      * Data Structure Requirements
      * Future: Using an Abstract Knowledge Representation

    General Functional Requirements
    These are the general requirements for how the system must operate, to
    be effective.

    This document, like the list of starting points mentioned earlier, is
    heavily hierarchical in nature -- as are most technical documents. These

    facts further underscore the need for a hierarchical system.

    For example, this email message should exist in outline form. It should
    be easy to add and remove entries to various sections: for example, the
    list of starting points given above.

    However, the hierarchy should function using XML-sytle "entity
    references" that copy the target contents into the displayed document,
    "inline". That permits multiple references to the same node. The result
    is effectively a lattice of information nodes, where any one view of it
    is hierarchical.

    To be strictly correct, the underlying data structure will be a directed
    graph. In reality, it will be bidirectional, and it will typically turn
    out to have cyclic loops. Although it would be nice to avoid that, it is
    probably unavoidable.

    The "network" nature of the graph results from the property that allows
    a document-segment (node or tree) to be used in multiple places. In each
    "document" that makes such an access, however, the view is hierarchical.
    The hierarchy is a view of the graph, and a "document" is really a
    structured collection of nodes from the data base.

    Unlike HTML, where references to other documents occurs only with links,
    references to other nodes and trees in this system will typically occur
    as "includes". The effect of the inclusions will be to make the material
    will appear inline, as though it were part of the original document.

    Although "hard" links to objects will be needed at times, in most cases
    the link to the "Requirements Document" should be a "soft" link -- that
    is, an indirect link that points to the latest version. That means never

    having to worry about looking at an old version of the spec.

    Each node in the hierarchy needs to be versioned, so that previous
    information is available. In addition, the task of displaying
    differences becomes essentially trivial.

    It must be possible to "publish" the whole document or sections of it by

    "posting" it. It must also be possible to create replies for individual
    sections, and then "post" them all at one time.

    At a minimum, every node in the system has two hierarchies descending
    from it. One is a list of content nodes that comprise the hierarchical
    document. The other is a list of reviewer comments. (Some comments will
    be specific to the information in that node, others will be intended as
    general comments for that section of the document.)

    Other sub-element lists may found to be desirable in the future, so the
    system should be "open-ended" in allowing other sublists to be added,
    identified, and accessed.

    Rather than using a central "repository", the system should employ the
    major strengths of email systems, namely: fast access on local systems
    and the robust nature of the system as a result of having redundant
    copies on many different systems. The system will be more space
    intensive than email systems, but storage costs are dropping
    precipitously, and future technologies paint an even brighter picture.

    To mitigate the short-term need for storage space, it should be possible

    to set individual storage policies. For example, a user will most likely

    not want to keep previous versions of any documents they are not
    personally involved in authoring.

    It must also be possible to add names to the authoring list. Name
    removal should probably be limited to the original author. For those
    cases when the original author is no longer part of the system, it
    should be possible to make a copy of the document and name a new primary


    When a new version of a document arrives, differences are highlighted.
    Old-version information becomes accessible through links (if saved).
    Differences are always against the last version that was visited. If a
    section of the document was never visited, the most recent version of
    that section is displayed on the first visit. If several iterations have

    taken place since the last visit, the cumulative differences are shown.
    (Again, node-versioning makes this user-friendly feature fairly

       Starting Points
      XMLTreeDiff at IBM Alphaworks (Lars Martin)

    Clearly support for web links is desirable, as shown by the links to the

    various possible starting points in the References post. [Note: Each of
    those should be evaluated against this requirements list, and used to
    modify these requirements.]

    Indirect links are needed, both to link to a list of related nodes, and
    to link to the latest version of a node.

    It must be possible to categorize nodes (and possibly links). For
    IBIS-style discussions, for example, node types include (at a minimum)
    question, alternative, pro, con, endorsement, and decision.

    For material that is included "in line" in the original document, typing

    implies the ability to choose which kinds of linked-information to
    include. For example, in addition to the current version, one might
    choose to display previous versions and/or all commentary.

    For material that is displayed in separate windows, typing allows the
    secondary windows to automatically display material of a given type.
    (For example, in Rod Welch's "contract alignment" example, the secondary

    window might automatically display the meeting minutes that are linked
    to particular phrases in a contract. Lines might be automatically drawn
    from sections of the minutes to sections of the contract. Other links in

    the documents, however, would be ignored.

    It should be possible to construct an initial design document using
    queries of the form "give me all design notes corresponding to the
    features we decided to implement in the current version of the
    functional specification.

    The many possible starting points in the References list highlights the
    need for evaluablility. It should be possible, not only to reply with a
    comment on any item in those lists, but also to add an evaluation, much
    as Amazon.com keeps evaluations for books. That feature is arguably
    their greatest contribution to ecommerce, and the DKR should make use of

    it. It should also be possible to order list items using relative
    evaluations. That lets the most promising starting point float to the
    top of the list.

    Not all lists should be ordered by evaluation, however. For example, the

    sequence of requirements has been chosen to provide the most natural
    "bridge" from one to the next. So evaluation-ordering must be an option.

    Ideally, it should also be possible to "weight" an evaluation, perhaps
    by adding a "yay" or "nay" to an existing evaluation.

    When displaying an evaluation, where evaluators can choose a value from
    1..5, it might make sense to display the average, the number of
    evaluations, and the distribution. A distribution like
      10 2 1 2 10
    for example, would show a highly polarized response, even though the
    "average" was 3.

      Starting Points
      * Architecture for Internet searching, categorization, and ranking

    The system must increase the ability of multiple people, working
    collaboratively, to generate up to date and accurate revisions.

    For any given document, there are several classes of interaction:
      * receive
      * comment
      * suggest
      * author

    The first group consists of people who receive the document and do
    nothing else with it. (Just trying to be complete here.) The second
    group consists of people who send back comments on different sections.
    That feedback will typically be used in future versions.

    The 3rd group consists of people who suggest an alternative wording or
    organization. Those "suggestions" take the form of a modified copy of
    the original. One of the document authors may then agree to use that
    formulation in place of the original, or may simply keep it as

    The 4th group consists of the fully-collaborative authoring group. The
    original author must be able to add other individuals to the document,
    or to subsections of it. (An author registered for a given node has
    authoring privileges throughout the hierarchy anchored at that node.)

    Every information node that is created should be automatically
    attributed to it's author. When a new version of a node is created, all
    of the people who sent comments should be contained in a "reviewer"
    list. When a suggestion is accepted, the author of the suggested node
    should go into a "contributor" list in the parent node and be added to
    the "author" list for the current node. It should be possible to
    identify all of the reviewers, contributors, and authors for the whole
    document and for each section of it.

    When new versions of a document are created, material would be included
    by pointing to it, keeping attributions intact. The system must
    accelerate that process. It should be possible to start a new document
    in one of two ways:
      * Copy the original document intact to create a new version
        of it. (Deletes and rearrangements then affect the new
        document, while the original version remains intact.

      * Create a document and designate it as the "target" so that,
        as you review other documents, selecting parts of it and
        issuing the "copy" command automatically stuffs it into the

    General Systemic Requirements
    These are requirements for the system as a whole.

    The system must be "open" in the sense that a user is not constrained to

    using a particular editor, email system, or central server. The
    specifications for interaction with the system should be freely
    available, along with a reference implementation to use as a basis. As
    much as possible, conformance with existing standards (XML, XHTML, HTTP,

    email) is desirable. (The tricky decisions, of course, will be between
    required features and standard protocols that don't support them.)

    The server and client systems that implement the DKR must also be fully
    *extensible*. In other words, the same characteristics of hierarchy,
    versioning, and revisability (use of most recent version) that apply to
    the documents must apply to the system itself.

    That extensibility can be accomplished with a "dispatch table" that
    names the class to use for each kind of object that needs to be created.

    In conjunction with open sourcing, that architecture allows a user to
    extend (subclass) an existing class and then use the extended version in

    place of the original. In addition, upgrades can occur dynamically,
    while the system is in operation, while allowing for modular downgrades
    when extensions don't work out.

       Starting Points
       * Warner Ornstine's Cords/Plugs/Sockets Architecture

    Security in such a system becomes an issue, unfortunately. The system
    should employ whatever mechanisms exist or can be constructed to help
    prevent trojan horse attacks, back door attacks, and other security
    breaches in an open source system.

    For example, Christine Peterson described Apache's process as having
    something like 45 reviewers, 3 of whom recommend the inclusion and none
    of whom object, before new code is added to the system.

    Email is fundamentally the right interface for such a system, because
    information comes to you, the information is organized into threads,
    and you can edit/reply from within the same application you use to
    view the information. (Email's major weaknesses stem from the fact
    that even though the interface is appropriate, the underlying data
    structures are not. But the hierarchy inherent in the specified
    system will rectify those flaws, eliminating the redundancy inherent
    in email responses and allowing for thread-summaries.)

    However, the factor that makes email central to one's daily activities
    is the wide variety of inputs you receive. Email is inherently "project
    neutral". You get email on every topic under the sun, including personal

    and professional interests. It represents "one stop shopping" for your
    information needs. (The Web, on the other hand, provides nicer
    storefronts, but you have to go visit the store to find what you want.)

    In a sense, the "firewall" requirement is in itself a partition. In an
    organization like the Standford Research Center (SRI), for example,
    there is a need to create a project-specific partition, so that only
    only other members of the project team ever see that information. On
    the other hand, there is a wide area of shared expertise (computer
    expertise, management expertise, administrative expertise) that can
    be shared among all members of the organization.

    In a similar vein, the "email interface model" implies the need for
    multiple partitions -- one for each project or interest area, for
    example. The degree to which you "cross-fertilize" between the
    partitions should then be up to you.

    Looking Ahead: Some DKR Requirements
    These additional requirements begin to move the system towards a DKR.

    With respect to security, there is also the issue of "firewall"
    capability. The DKR must allow professionals in many different
    organizations to contribute and share knowledge. That knowledge may
    largely be in the form of published papers and the means to locate and
    access them, but it represents a high-degree of inter-organizational
    co-operation, at the level of the individual professional.

    The DKR will also be handy for individual projects, though. The
    mechanisms will support collaborative designs and "on demand" education
    as to corporate procedures, for example. But that information must
    remain *inside* the firewall, inaccessible to competitors.

    In the ideal scenario, it will also be possible to "publish" information

    stored in the inner repository at strategic times, rather like
    publishing a technical paper that gives the design of the system. But
    until then, the firewall must remain intact.

    Didactic (DKR)
    Eventually, the system must become a *teaching* tool. It must follow the

    concept of "Education on Demand", intelligently supplying the user with
    the information needed, and educating that user, whatever their initial
    background. (Within reasonable limits.)

    Outline of Operational Requirements
    This is an outline of functional operations for the system:

      * Editing
        --Add, change, delete, move nodes
        --Copy nodes
          ..node alone, current-version subtree, whole subtree
        --Link (indirect, "soft" links, and direct "hard" links)
        --Automatic versioning
        --Automatic attribution

      * Email
          ..Increment version number for future edits
          ..Deliver to group via server
          ..Automatically diff against last visited version of
            each node
          ..Highlight diffs
          .."Go to next unread" feature

      * Attribution
        --New node: author=currUser, lastEditor=currUser
        --Copy node: all lists unchanged
        --Modify node: lastEditor=currUser
        --Copy text: new node created, all lists copied
        --Paste text: Author-list + Contributor list from the
                       clipboard node merge into the contributor
                       list for the current node
          This is a highly imperfect solution to the attribution
          problem. Copying a single word out of a very large node
          stands to create a highly-inaccurate contributor list.
          On the other hand, creating a new node and pasting all
          of the text from the old one would drop attributions
          A better alternative, if feasible, would be attributions
          attached to every phrase in the node. That requirement
          creates a third category of containment for the node,
          consisting of the text that makes it up. When originally
          created, there would only be one long phrase, and it's
          author. When others make changes, the text would be
          broken up into segments. That's the same architecture
          most editors use internally, anyway, but it would require
          storing a lot more information, putting it together to
          display the node, and taking it into account when copying
          and pasting.

      * Phantom Nodes
        --Since it is possible to receive comments on nodes that
          have been deleted from the current (not yet published)
          draft, the system must maintain "phantom" nodes that
          can be used to collect such comments.
        --Phantom nodes are invisible until a comment is received.
          Theoretically, they can disappear once the current version
          is posted (since future comments will be on that version).
          In practice, though, there The comments
          themselves are always stored under the original node.
        --As an alternative, the system could operate like the CRIT
          system, where such comments go to the end of the document.

      * Trash Bin
        --Each node needs a trash bin that collects nodes which
          are deleted from under it. Trash bins are never emptied,
          except by explicit action requiring multiple explicit

      * Distributed Editing Control
        --The comment/version-publishing system means that locks
          are not required for single-author documents. But for
          multiple authors to collaborate, it must be possible to
          prevent editing conflicts.
        --One possibility is to implement distributed locks.
          The major issue there is handling communication outages.
        --An equally viable possibility may be to allow
          simultaneous edits and detect their occurrence
          when a new version is received. The competing
          versions can then be displayed side-by-side
          along with user-selectable merge options.
        --Detection of competing versions may require something
          other than simple version numbers. Or perhaps the
          versionID would consist of the version number combined
          with the ID of the current writer.
        --TrashBin nodes must maintain a pointer to the phantom
          that is left behind after deletes, or to the location
          at which to create such a phantom.

      * Version Identification
        --A monotonically increasing version#, combined with the
          ID of the most recent editor *should* be sufficient to
          identify changes in a node. It may be that a timestamp
          works better, though. Even a timestamp will need to be
          combined with the most-recent-editor-ID, though, to
          identify competing versions created by different authors.
          (Although matching a millisecond-timestamp is improbable,
          it is not impossible.)
        --The version number for a node would be the maximum of
          the version numbers for all content subnodes. When
          edited, the new version number would either be a timestamp
          or the parent version# + 1. (All parents would then be
        --TimeStamps probably make more sense, since edits using
          the algorithm above will make the version# "jump around"
          quite a bit.
        --In either case, a more "user-friendly" version number is
          needed for the document as a whole.
        --The system needs to account for a "hierarchy of versions"
          of at least two levels. The first level is for a set of
          documents. (All documents for version 2.0 of the system,
          for example.) The second level is the version of the
          document itself. (Version 3 of the 2.0 Requirements Doc).
          (How deep should it go? Large subsections might have
          versions, as well. Below that?)

    Data Structure Requirements
    Each node in the system should be able to track the following
      * Unique identifier (so links always work)
      * List of Content subelements
      * List of Comment subelements
      * List of elements comprising the content-text,
        with attributions (if implemented)
      * Version-identifier for the node
      * Version-identifier for the content sublist
      * Author list
      * Contributor list
      * Reviewer list
      * Last editor
      * Evaluation list
      * Evaluation summary
      * Distributed Lock (unless Competing Versions is chosen)
      * Trash Bin
      * isPhantom identifier
      * pointer to own phantom

    Use Cases & Scenarios
    After the initial version of the data/object structures
    has been nailed down, they need to be run through a series
    of use case scenarios, with the data manipulations defined
    for each. The goal of the process will be to refine the
    data structures, looking for weaknesses or necessary
    reorganizations. [Note: Some scenarios may need to be
    tabled as unsuitable for the initial system.]

    General Scenarios
      * Software Development discussions and documents
        --IBIS-style discussions
        --Functional Specs-->Design Specs -+
            +-->User Doc'n Source Code <-+
            +-->FAQ Tests --------+
            +---Bug Reports----------------+

      * Strategic Decisions (combinations)
        --multiple possibilities identified (~= alternatives)
        --proposals consist of combinations of possibilities
        --one proposal selected

      * Build a Product/Feature comparison chart
        --Feature rows, product columns
        --Adding a column suggests a new feature, then
          track the "back-gathering" of data on prev. products

      * Build a Requirements/Technology evaluation chart
        --Requirements rows, Technology columns
        --Must-Have, Nice-To-Have, Optional categories
        --Y/N cells &/or evaluation cells
        --Adding a new technology suggests additional
          "must have" feature

      * Project Management
        --implementation checklists & signups
          (track who signed up to do what)

      * Multiple Software Versions
        --Series of tutorial examples
        --Code branches with common elements

      * IBIS-style discusssions
        --Add questions, posit alternatives, evaluate & decide
        --Subsume propositions as alternatives under a question

      * Mathematical/Logical Reasoning
        --Asertions, Negations
        --Implications (a->b)
        --Inferences (a->b + b->c + a => c)

    Specific Use Cases
      * Comments
        --Comment on a node
        --Comment on a structure
      * Suggestions
        --Suggest a text revision
        --Suggest a new node
        --Suggest a new structure
        --Accept/reject a suggestion
      * Reduction
        --Edit a copy
        --integrate comments
          ..fold in and remove, or
          ..reject and remove
        --New version replaces old, and links to it.
      * Competing Versions
        --become "siblings"? -- a parent needed
        --Use IBIS model for resolution?
        --Evaluations, leading to eventual selection

    Future: Using an Abstract Knowledge Representation
    A hierarchical system is created from only two relationships:
       * Containment
       * Ordering

    If progress is made in the pursuit of abstract knowledge
    representations, it may be that the whole of collaborative document
    system may well migrate into a knowledge representation, using those two

    relationships. The document management system would then be a subset of
    a much larger knowledge management repository.

    One wonders what such a system will look like after it begins to be
    extended with thousands of additional relationships.

    It boggles the mind.

    Community email addresses:
      Post message: unrev-II@onelist.com
      Subscribe: unrev-II-subscribe@onelist.com
      Unsubscribe: unrev-II-unsubscribe@onelist.com
      List owner: unrev-II-owner@onelist.com

    Shortcut URL to this page:

    Life's too short to send boring email. Let SuperSig come to the rescue.

    Community email addresses:
      Post message: unrev-II@onelist.com
      Subscribe: unrev-II-subscribe@onelist.com
      Unsubscribe: unrev-II-unsubscribe@onelist.com
      List owner: unrev-II-owner@onelist.com

    Shortcut URL to this page:

    This archive was generated by hypermail 2b29 : Mon May 08 2000 - 15:40:22 PDT