[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links

Thanks, Henry.    (01)

The next move is a research project:
   * Come up to speed on the state of the art in network theory.
      while at the same time beginning construction.
   * Use a google-style web crawler, since it is already doing
      enough parsing to identify links.
   * Then modify the parser to find lists that match the pattern,
      and put together a list of words to ignore in the headings.
   * Finally, put a human in the loop. Human reviews results and
      says "good" or "no good", modifying the pattern in the process.
      (Interestingly, a human could by such votes program a neural-net
       machine which would become increasingly good at recognizing
       useful patterns -- although we would never have a clue as to
       what it was doing.)    (02)

Whether or not anything useful would come of it is hard to say.
Whether it would duplicate Cyc, complement Cyc, use Cyc to
help construct it's ontology, or go way beyond Cyc, is another
matter that would have to be determined by experience.    (03)

Regardless of its success or failure, it would definitely make an
interesting paper.    (04)

Henry K van Eyken wrote:    (05)

> This is the kind of thinking that could lead to an important tool in the
> construction of Doug's idea of a continually updated encyclopedia or
> handbook.
> What is the next move?
> Henry
> Eric Armstrong wrote:
> > A few ideas rubbed together the other day, and it occurred
> > to me that a web crawler capable of parsing HTML pages to
> > find links already has enough intelligence to begin constructing
> > a first-cut ontology.
> >
> >   Note:
> >   The mechanism described here may be something like the idea
> >   behind the Teoma search engine (http://www.teoma.com), although
> >   they may well have other mechanisms, in addition to this one.
> >
> > The first thought was that "weak links" predict similarity much
> > better than "strong links". ("Strong links" describes clustered
> > material -- material that is in close proximity, with many
> > individual links between them, as well as links to other pages,
> > all of which link to each other.
> >
> > In this context, it makes sense to think of a directory hierarchy as
> > "linked". So it's clear that a collection of pages at a company or a
> > college have something in common, but generally such a collection
> > of pages embodies *many* ontological concepts. So strongly
> > linked pages are not that good for identifying concepts.
> >
> > But if two separate clusters have a single connection
> > between them -- a weak link -- then that link implies *some* kind
> > of similarity. That recognition then entails two further problems:
> >    a. Giving a name to the concept that identifies the similarity.
> >    b. Separating reference-type links (and other "non-similar") links
> >       from links that indicate similarity ("other things of this kind")
> >
> > For example, on a page describing exercises, there could be
> > references to anatomy descriptions, and links to equipment
> > manufacturers, as well as links to similar exercises. Each would
> > be a weak link, but any similarities would be non-obvious.
> >
> > The problem is to identify which links indicate "similarity". But
> > it occurs to me that HTML formatting may well provide enough
> > clues to make some good guesses.
> >
> > Basically, a "weak link" page that gives a list of links is more likely
> > than not to be identifying an ontological concept.
> >
> > The format for such concept references would be:
> >
> >   1. A heading with one or two major words. For example:
> >       --Equipment
> >       --Exercise Equipment
> >       --Exercises
> >       --Authors of Note
> >       --Signs of the Times
> >
> >   2. A short paragraph of introductory text.
> >
> >   3. List items containing short paragraphs, each with one link
> >
> > Of course, there are some lists that would not be useful. For
> > example, JavaWorld articles always end with a "Resources"
> > section. The concept is obviously not "resources", but is
> > rather the subject matter covered in the article.
> >
> > Still, it would be possible to filter out the limited number of
> > such headings ("for more information", "further reading",
> > and the like, the same way that small words like "of" and
> > "the" would be filtered out. What's left, in the context of
> > the web, would be a collection of named ontological
> > concepts that could be reviewed and edited.
> >
> > Of course, at this point the "ontology" would look like a
> > simple list of concepts, with no ordering or structuring.
> > And duplicate concepts with different names would have
> > to be linked, somehow.
> >
> > But it could be a start. Further examination of structural
> > relationhips might well lead to connections within the
> > ontology. For example, the concept of "bicycles" is
> > identified, and a "parts list" on several pages contains
> > a "derailleur" entry, then perhaps it would be possible to
> > identfiy the "derailleur is part of a bicycle" relationship.
> >
> > Similarly, a book that showed up in the "resources" section
> > of a few pages could lead to "book x is a resource for
> > bicycles".
> >
> > I dunno. It's an interesting possibility -- that with a modicum
> > of semantic knowledge, it might be possible to construct a
> > very sizable ontology from the contents of the web.    (06)