Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links
Eric Armstrong wrote: (01)
> Thanks, Henry.
>
> The next move is a research project:
> * Come up to speed on the state of the art in network theory.
> while at the same time beginning construction.
> * Use a google-style web crawler, since it is already doing
> enough parsing to identify links.
> * Then modify the parser to find lists that match the pattern,
> and put together a list of words to ignore in the headings.
> * Finally, put a human in the loop. Human reviews results and
> says "good" or "no good", modifying the pattern in the process.
> (Interestingly, a human could by such votes program a neural-net
> machine which would become increasingly good at recognizing
> useful patterns -- although we would never have a clue as to
> what it was doing.)
>
> Whether or not anything useful would come of it is hard to say.
> Whether it would duplicate Cyc, complement Cyc, use Cyc to
> help construct it's ontology, or go way beyond Cyc, is another
> matter that would have to be determined by experience.
>
> Regardless of its success or failure, it would definitely make an
> interesting paper. (02)
Note:
The foregoing unfortunately cannot be construed as a commitment
to actually pursue this line of research. My employer occupies my
half my time. My book and what has turned into 8 patentable ideas
for exercise equipment are occupying the other half. Between halves,
I sleep. (03)
> Henry K van Eyken wrote:
>
> > This is the kind of thinking that could lead to an important tool in the
> > construction of Doug's idea of a continually updated encyclopedia or
> > handbook.
> >
> > What is the next move?
> >
> > Henry
> >
> > Eric Armstrong wrote:
> >
> > > A few ideas rubbed together the other day, and it occurred
> > > to me that a web crawler capable of parsing HTML pages to
> > > find links already has enough intelligence to begin constructing
> > > a first-cut ontology.
> > >
> > > Note:
> > > The mechanism described here may be something like the idea
> > > behind the Teoma search engine (http://www.teoma.com), although
> > > they may well have other mechanisms, in addition to this one.
> > >
> > > The first thought was that "weak links" predict similarity much
> > > better than "strong links". ("Strong links" describes clustered
> > > material -- material that is in close proximity, with many
> > > individual links between them, as well as links to other pages,
> > > all of which link to each other.
> > >
> > > In this context, it makes sense to think of a directory hierarchy as
> > > "linked". So it's clear that a collection of pages at a company or a
> > > college have something in common, but generally such a collection
> > > of pages embodies *many* ontological concepts. So strongly
> > > linked pages are not that good for identifying concepts.
> > >
> > > But if two separate clusters have a single connection
> > > between them -- a weak link -- then that link implies *some* kind
> > > of similarity. That recognition then entails two further problems:
> > > a. Giving a name to the concept that identifies the similarity.
> > > b. Separating reference-type links (and other "non-similar") links
> > > from links that indicate similarity ("other things of this kind")
> > >
> > > For example, on a page describing exercises, there could be
> > > references to anatomy descriptions, and links to equipment
> > > manufacturers, as well as links to similar exercises. Each would
> > > be a weak link, but any similarities would be non-obvious.
> > >
> > > The problem is to identify which links indicate "similarity". But
> > > it occurs to me that HTML formatting may well provide enough
> > > clues to make some good guesses.
> > >
> > > Basically, a "weak link" page that gives a list of links is more likely
> > > than not to be identifying an ontological concept.
> > >
> > > The format for such concept references would be:
> > >
> > > 1. A heading with one or two major words. For example:
> > > --Equipment
> > > --Exercise Equipment
> > > --Exercises
> > > --Authors of Note
> > > --Signs of the Times
> > >
> > > 2. A short paragraph of introductory text.
> > >
> > > 3. List items containing short paragraphs, each with one link
> > >
> > > Of course, there are some lists that would not be useful. For
> > > example, JavaWorld articles always end with a "Resources"
> > > section. The concept is obviously not "resources", but is
> > > rather the subject matter covered in the article.
> > >
> > > Still, it would be possible to filter out the limited number of
> > > such headings ("for more information", "further reading",
> > > and the like, the same way that small words like "of" and
> > > "the" would be filtered out. What's left, in the context of
> > > the web, would be a collection of named ontological
> > > concepts that could be reviewed and edited.
> > >
> > > Of course, at this point the "ontology" would look like a
> > > simple list of concepts, with no ordering or structuring.
> > > And duplicate concepts with different names would have
> > > to be linked, somehow.
> > >
> > > But it could be a start. Further examination of structural
> > > relationhips might well lead to connections within the
> > > ontology. For example, the concept of "bicycles" is
> > > identified, and a "parts list" on several pages contains
> > > a "derailleur" entry, then perhaps it would be possible to
> > > identfiy the "derailleur is part of a bicycle" relationship.
> > >
> > > Similarly, a book that showed up in the "resources" section
> > > of a few pages could lead to "book x is a resource for
> > > bicycles".
> > >
> > > I dunno. It's an interesting possibility -- that with a modicum
> > > of semantic knowledge, it might be possible to construct a
> > > very sizable ontology from the contents of the web. (04)