Re: [ba-unrev-talk] Instant Ontologies: The Strength of Weak Links
Eric. (01)
This is the kind of thinking that could lead to an important tool in the
construction of Doug's idea of a continually updated encyclopedia or
handbook. (02)
What is the next move? (03)
Henry (04)
Eric Armstrong wrote: (05)
> A few ideas rubbed together the other day, and it occurred
> to me that a web crawler capable of parsing HTML pages to
> find links already has enough intelligence to begin constructing
> a first-cut ontology.
>
> Note:
> The mechanism described here may be something like the idea
> behind the Teoma search engine (http://www.teoma.com), although
> they may well have other mechanisms, in addition to this one.
>
> The first thought was that "weak links" predict similarity much
> better than "strong links". ("Strong links" describes clustered
> material -- material that is in close proximity, with many
> individual links between them, as well as links to other pages,
> all of which link to each other.
>
> In this context, it makes sense to think of a directory hierarchy as
> "linked". So it's clear that a collection of pages at a company or a
> college have something in common, but generally such a collection
> of pages embodies *many* ontological concepts. So strongly
> linked pages are not that good for identifying concepts.
>
> But if two separate clusters have a single connection
> between them -- a weak link -- then that link implies *some* kind
> of similarity. That recognition then entails two further problems:
> a. Giving a name to the concept that identifies the similarity.
> b. Separating reference-type links (and other "non-similar") links
> from links that indicate similarity ("other things of this kind")
>
> For example, on a page describing exercises, there could be
> references to anatomy descriptions, and links to equipment
> manufacturers, as well as links to similar exercises. Each would
> be a weak link, but any similarities would be non-obvious.
>
> The problem is to identify which links indicate "similarity". But
> it occurs to me that HTML formatting may well provide enough
> clues to make some good guesses.
>
> Basically, a "weak link" page that gives a list of links is more likely
> than not to be identifying an ontological concept.
>
> The format for such concept references would be:
>
> 1. A heading with one or two major words. For example:
> --Equipment
> --Exercise Equipment
> --Exercises
> --Authors of Note
> --Signs of the Times
>
> 2. A short paragraph of introductory text.
>
> 3. List items containing short paragraphs, each with one link
>
> Of course, there are some lists that would not be useful. For
> example, JavaWorld articles always end with a "Resources"
> section. The concept is obviously not "resources", but is
> rather the subject matter covered in the article.
>
> Still, it would be possible to filter out the limited number of
> such headings ("for more information", "further reading",
> and the like, the same way that small words like "of" and
> "the" would be filtered out. What's left, in the context of
> the web, would be a collection of named ontological
> concepts that could be reviewed and edited.
>
> Of course, at this point the "ontology" would look like a
> simple list of concepts, with no ordering or structuring.
> And duplicate concepts with different names would have
> to be linked, somehow.
>
> But it could be a start. Further examination of structural
> relationhips might well lead to connections within the
> ontology. For example, the concept of "bicycles" is
> identified, and a "parts list" on several pages contains
> a "derailleur" entry, then perhaps it would be possible to
> identfiy the "derailleur is part of a bicycle" relationship.
>
> Similarly, a book that showed up in the "resources" section
> of a few pages could lead to "book x is a resource for
> bicycles".
>
> I dunno. It's an interesting possibility -- that with a modicum
> of semantic knowledge, it might be possible to construct a
> very sizable ontology from the contents of the web. (06)