Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links

Eric.    (01)

This is the kind of thinking that could lead to an important tool in the
construction of Doug's idea of a continually updated encyclopedia or
handbook.    (02)

What is the next move?    (03)

Henry    (04)

Eric Armstrong wrote:    (05)

> A few ideas rubbed together the other day, and it occurred
> to me that a web crawler capable of parsing HTML pages to
> find links already has enough intelligence to begin constructing
> a first-cut ontology.
>   Note:
>   The mechanism described here may be something like the idea
>   behind the Teoma search engine (http://www.teoma.com), although
>   they may well have other mechanisms, in addition to this one.
> The first thought was that "weak links" predict similarity much
> better than "strong links". ("Strong links" describes clustered
> material -- material that is in close proximity, with many
> individual links between them, as well as links to other pages,
> all of which link to each other.
> In this context, it makes sense to think of a directory hierarchy as
> "linked". So it's clear that a collection of pages at a company or a
> college have something in common, but generally such a collection
> of pages embodies *many* ontological concepts. So strongly
> linked pages are not that good for identifying concepts.
> But if two separate clusters have a single connection
> between them -- a weak link -- then that link implies *some* kind
> of similarity. That recognition then entails two further problems:
>    a. Giving a name to the concept that identifies the similarity.
>    b. Separating reference-type links (and other "non-similar") links
>       from links that indicate similarity ("other things of this kind")
> For example, on a page describing exercises, there could be
> references to anatomy descriptions, and links to equipment
> manufacturers, as well as links to similar exercises. Each would
> be a weak link, but any similarities would be non-obvious.
> The problem is to identify which links indicate "similarity". But
> it occurs to me that HTML formatting may well provide enough
> clues to make some good guesses.
> Basically, a "weak link" page that gives a list of links is more likely
> than not to be identifying an ontological concept.
> The format for such concept references would be:
>   1. A heading with one or two major words. For example:
>       --Equipment
>       --Exercise Equipment
>       --Exercises
>       --Authors of Note
>       --Signs of the Times
>   2. A short paragraph of introductory text.
>   3. List items containing short paragraphs, each with one link
> Of course, there are some lists that would not be useful. For
> example, JavaWorld articles always end with a "Resources"
> section. The concept is obviously not "resources", but is
> rather the subject matter covered in the article.
> Still, it would be possible to filter out the limited number of
> such headings ("for more information", "further reading",
> and the like, the same way that small words like "of" and
> "the" would be filtered out. What's left, in the context of
> the web, would be a collection of named ontological
> concepts that could be reviewed and edited.
> Of course, at this point the "ontology" would look like a
> simple list of concepts, with no ordering or structuring.
> And duplicate concepts with different names would have
> to be linked, somehow.
> But it could be a start. Further examination of structural
> relationhips might well lead to connections within the
> ontology. For example, the concept of "bicycles" is
> identified, and a "parts list" on several pages contains
> a "derailleur" entry, then perhaps it would be possible to
> identfiy the "derailleur is part of a bicycle" relationship.
> Similarly, a book that showed up in the "resources" section
> of a few pages could lead to "book x is a resource for
> bicycles".
> I dunno. It's an interesting possibility -- that with a modicum
> of semantic knowledge, it might be possible to construct a
> very sizable ontology from the contents of the web.    (06)