[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links


Eric Armstrong wrote:    (01)

> Thanks, Henry.
>
> The next move is a research project:
>    * Come up to speed on the state of the art in network theory.
>       while at the same time beginning construction.
>    * Use a google-style web crawler, since it is already doing
>       enough parsing to identify links.
>    * Then modify the parser to find lists that match the pattern,
>       and put together a list of words to ignore in the headings.
>    * Finally, put a human in the loop. Human reviews results and
>       says "good" or "no good", modifying the pattern in the process.
>       (Interestingly, a human could by such votes program a neural-net
>        machine which would become increasingly good at recognizing
>        useful patterns -- although we would never have a clue as to
>        what it was doing.)
>
> Whether or not anything useful would come of it is hard to say.
> Whether it would duplicate Cyc, complement Cyc, use Cyc to
> help construct it's ontology, or go way beyond Cyc, is another
> matter that would have to be determined by experience.
>
> Regardless of its success or failure, it would definitely make an
> interesting paper.    (02)

Note:
The foregoing unfortunately cannot be construed as a commitment
to actually pursue this line of research. My employer occupies my
half my time. My book and what has turned into 8 patentable ideas
for exercise equipment are occupying the other half. Between halves,
I sleep.    (03)

> Henry K van Eyken wrote:
>
> > This is the kind of thinking that could lead to an important tool in the
> > construction of Doug's idea of a continually updated encyclopedia or
> > handbook.
> >
> > What is the next move?
> >
> > Henry
> >
> > Eric Armstrong wrote:
> >
> > > A few ideas rubbed together the other day, and it occurred
> > > to me that a web crawler capable of parsing HTML pages to
> > > find links already has enough intelligence to begin constructing
> > > a first-cut ontology.
> > >
> > >   Note:
> > >   The mechanism described here may be something like the idea
> > >   behind the Teoma search engine (http://www.teoma.com), although
> > >   they may well have other mechanisms, in addition to this one.
> > >
> > > The first thought was that "weak links" predict similarity much
> > > better than "strong links". ("Strong links" describes clustered
> > > material -- material that is in close proximity, with many
> > > individual links between them, as well as links to other pages,
> > > all of which link to each other.
> > >
> > > In this context, it makes sense to think of a directory hierarchy as
> > > "linked". So it's clear that a collection of pages at a company or a
> > > college have something in common, but generally such a collection
> > > of pages embodies *many* ontological concepts. So strongly
> > > linked pages are not that good for identifying concepts.
> > >
> > > But if two separate clusters have a single connection
> > > between them -- a weak link -- then that link implies *some* kind
> > > of similarity. That recognition then entails two further problems:
> > >    a. Giving a name to the concept that identifies the similarity.
> > >    b. Separating reference-type links (and other "non-similar") links
> > >       from links that indicate similarity ("other things of this kind")
> > >
> > > For example, on a page describing exercises, there could be
> > > references to anatomy descriptions, and links to equipment
> > > manufacturers, as well as links to similar exercises. Each would
> > > be a weak link, but any similarities would be non-obvious.
> > >
> > > The problem is to identify which links indicate "similarity". But
> > > it occurs to me that HTML formatting may well provide enough
> > > clues to make some good guesses.
> > >
> > > Basically, a "weak link" page that gives a list of links is more likely
> > > than not to be identifying an ontological concept.
> > >
> > > The format for such concept references would be:
> > >
> > >   1. A heading with one or two major words. For example:
> > >       --Equipment
> > >       --Exercise Equipment
> > >       --Exercises
> > >       --Authors of Note
> > >       --Signs of the Times
> > >
> > >   2. A short paragraph of introductory text.
> > >
> > >   3. List items containing short paragraphs, each with one link
> > >
> > > Of course, there are some lists that would not be useful. For
> > > example, JavaWorld articles always end with a "Resources"
> > > section. The concept is obviously not "resources", but is
> > > rather the subject matter covered in the article.
> > >
> > > Still, it would be possible to filter out the limited number of
> > > such headings ("for more information", "further reading",
> > > and the like, the same way that small words like "of" and
> > > "the" would be filtered out. What's left, in the context of
> > > the web, would be a collection of named ontological
> > > concepts that could be reviewed and edited.
> > >
> > > Of course, at this point the "ontology" would look like a
> > > simple list of concepts, with no ordering or structuring.
> > > And duplicate concepts with different names would have
> > > to be linked, somehow.
> > >
> > > But it could be a start. Further examination of structural
> > > relationhips might well lead to connections within the
> > > ontology. For example, the concept of "bicycles" is
> > > identified, and a "parts list" on several pages contains
> > > a "derailleur" entry, then perhaps it would be possible to
> > > identfiy the "derailleur is part of a bicycle" relationship.
> > >
> > > Similarly, a book that showed up in the "resources" section
> > > of a few pages could lead to "book x is a resource for
> > > bicycles".
> > >
> > > I dunno. It's an interesting possibility -- that with a modicum
> > > of semantic knowledge, it might be possible to construct a
> > > very sizable ontology from the contents of the web.    (04)