Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links
Eric, (01)
While you're jogging, perhaps? :-) (02)
Henry (03)
Eric Armstrong wrote: (04)
> Eric Armstrong wrote:
>
> > Thanks, Henry.
> >
> > The next move is a research project:
> > * Come up to speed on the state of the art in network theory.
> > while at the same time beginning construction.
> > * Use a google-style web crawler, since it is already doing
> > enough parsing to identify links.
> > * Then modify the parser to find lists that match the pattern,
> > and put together a list of words to ignore in the headings.
> > * Finally, put a human in the loop. Human reviews results and
> > says "good" or "no good", modifying the pattern in the process.
> > (Interestingly, a human could by such votes program a neural-net
> > machine which would become increasingly good at recognizing
> > useful patterns -- although we would never have a clue as to
> > what it was doing.)
> >
> > Whether or not anything useful would come of it is hard to say.
> > Whether it would duplicate Cyc, complement Cyc, use Cyc to
> > help construct it's ontology, or go way beyond Cyc, is another
> > matter that would have to be determined by experience.
> >
> > Regardless of its success or failure, it would definitely make an
> > interesting paper.
>
> Note:
> The foregoing unfortunately cannot be construed as a commitment
> to actually pursue this line of research. My employer occupies my
> half my time. My book and what has turned into 8 patentable ideas
> for exercise equipment are occupying the other half. Between halves,
> I sleep.
>
> > Henry K van Eyken wrote:
> >
> > > This is the kind of thinking that could lead to an important tool in the
> > > construction of Doug's idea of a continually updated encyclopedia or
> > > handbook.
> > >
> > > What is the next move?
> > >
> > > Henry
> > >
> > > Eric Armstrong wrote:
> > >
> > > > A few ideas rubbed together the other day, and it occurred
> > > > to me that a web crawler capable of parsing HTML pages to
> > > > find links already has enough intelligence to begin constructing
> > > > a first-cut ontology.
> > > >
> > > > Note:
> > > > The mechanism described here may be something like the idea
> > > > behind the Teoma search engine (http://www.teoma.com), although
> > > > they may well have other mechanisms, in addition to this one.
> > > >
> > > > The first thought was that "weak links" predict similarity much
> > > > better than "strong links". ("Strong links" describes clustered
> > > > material -- material that is in close proximity, with many
> > > > individual links between them, as well as links to other pages,
> > > > all of which link to each other.
> > > >
> > > > In this context, it makes sense to think of a directory hierarchy as
> > > > "linked". So it's clear that a collection of pages at a company or a
> > > > college have something in common, but generally such a collection
> > > > of pages embodies *many* ontological concepts. So strongly
> > > > linked pages are not that good for identifying concepts.
> > > >
> > > > But if two separate clusters have a single connection
> > > > between them -- a weak link -- then that link implies *some* kind
> > > > of similarity. That recognition then entails two further problems:
> > > > a. Giving a name to the concept that identifies the similarity.
> > > > b. Separating reference-type links (and other "non-similar") links
> > > > from links that indicate similarity ("other things of this kind")
> > > >
> > > > For example, on a page describing exercises, there could be
> > > > references to anatomy descriptions, and links to equipment
> > > > manufacturers, as well as links to similar exercises. Each would
> > > > be a weak link, but any similarities would be non-obvious.
> > > >
> > > > The problem is to identify which links indicate "similarity". But
> > > > it occurs to me that HTML formatting may well provide enough
> > > > clues to make some good guesses.
> > > >
> > > > Basically, a "weak link" page that gives a list of links is more likely
> > > > than not to be identifying an ontological concept.
> > > >
> > > > The format for such concept references would be:
> > > >
> > > > 1. A heading with one or two major words. For example:
> > > > --Equipment
> > > > --Exercise Equipment
> > > > --Exercises
> > > > --Authors of Note
> > > > --Signs of the Times
> > > >
> > > > 2. A short paragraph of introductory text.
> > > >
> > > > 3. List items containing short paragraphs, each with one link
> > > >
> > > > Of course, there are some lists that would not be useful. For
> > > > example, JavaWorld articles always end with a "Resources"
> > > > section. The concept is obviously not "resources", but is
> > > > rather the subject matter covered in the article.
> > > >
> > > > Still, it would be possible to filter out the limited number of
> > > > such headings ("for more information", "further reading",
> > > > and the like, the same way that small words like "of" and
> > > > "the" would be filtered out. What's left, in the context of
> > > > the web, would be a collection of named ontological
> > > > concepts that could be reviewed and edited.
> > > >
> > > > Of course, at this point the "ontology" would look like a
> > > > simple list of concepts, with no ordering or structuring.
> > > > And duplicate concepts with different names would have
> > > > to be linked, somehow.
> > > >
> > > > But it could be a start. Further examination of structural
> > > > relationhips might well lead to connections within the
> > > > ontology. For example, the concept of "bicycles" is
> > > > identified, and a "parts list" on several pages contains
> > > > a "derailleur" entry, then perhaps it would be possible to
> > > > identfiy the "derailleur is part of a bicycle" relationship.
> > > >
> > > > Similarly, a book that showed up in the "resources" section
> > > > of a few pages could lead to "book x is a resource for
> > > > bicycles".
> > > >
> > > > I dunno. It's an interesting possibility -- that with a modicum
> > > > of semantic knowledge, it might be possible to construct a
> > > > very sizable ontology from the contents of the web. (05)