[ba-ohs-talk] Searching vs. Keyword Indexing
At 11:11 AM 4/26/02 +0100, Murray Altheim wrote:
>But a more meta question is: what exactly are the requirements, and
>what are the benefits? Couldn't all this be done server-side on the
>mail list archives, such that if one wanted to browse the archives
>an intelligent search could remove the need for most of the effort?
>I'm still not convinced that people would use it, insofar as it's
>probably almost as much work to figure out an *appropriate* set of
>keywords as it is to type an entire email message. Librarians are
>*experts* at this. I'm not. There's a CMU system I used at NTTC that
>could analyze a text and come up with a set of keywords for it. I'd
>prefer we leave this kind of thing to computers (which are in general
>pretty good at it, especially on longer texts). (01)
Ok, here is why keyword indexing is better (in some cases) then
searching. Take a simple keyword like "software announce" or "new
software", which I want to use as an indicator that a new piece of software
is being pointed out. (02)
Suppose that you know that about a month ago, someone mentioned a piece of
software that reminds you of something that you just looked at. You can't
remember exactly what it was, who mentioned it, or what was the subject of
the post. (03)
How can you search for this product? Should you search for the term
"software"? The problem with this is that A. there is certainly going to
be a lot of noise since software is a pretty frequent term even when not
discussing a new product. And B. chances are that "software" would not
even have been used in the announcement. (04)
Basically, it is often the case that you are looking for something the name
of which escapes you. Tip of the tongue sort of phenomenon. In this case
searching is useless. But not keywords. If you know that what you saw was
marked as new, or that it was used for collaboration, then you can check
out those keyword categories and see a list of all the relevant posts. I
am certain that this will narrow down your choices much more then searching
would. (05)
========== (06)
And here is another important point in regard to (07)
"it's probably almost as much work to figure out an *appropriate* set of
keywords as it is to type an entire email message. Librarians are *experts*
at this. I'm not." (08)
THAT IS THE WHOLE POINT THAT IT'S A LOT OF WORK (09)
It gets you to think! You are not just generating noise sending unrelated
bits of information to the newsgroup, you have to think and decide how what
you are posting is relevant. You have to say, oh, I saw something like
this in the news group before. That something is similar to this new thing
because both deal with XXX. I think that I'll create a keyword called XXX
so that the next person that comes along can put the item in the same bucket. (010)
The work that's being done is in defining an ontology. If we can't put
keywords on our posts, then we don't know what we are talking about. You
can't expect meaning to arise from a semantic analysis, or from a Librarian. (011)
This is the bootstrap institute here, right? The Open-Hyperdocument
System. We are talking about something that does not yet exist. We are
trying to create this system through our discussion of it, and by pointing
out technologies that seem to be leading up to it. A librarian would not
be helpful because a librarian can only put things in existing
categories. The categories that we are talking about might not have been
created yet. (012)
Think of instant outlining, or the google api. Until recently there was no
such concepts as collaborative transcluding outlines or web based search
engine interfaces. Maybe these concepts fit into existing categories, and
maybe they don't, but we can be certain that new stuff is going to come
along that will challenge any existing ontology. (013)
Having found an interesting bit of information the work done to categorize
it is not that great, but the short term and especially the long term
benefits of doing so are large. It's kind of like commenting code. It
seems unnecessary when it's fresh, but without comments you can not go back
much later and resume your work, and it's not very usable by others either. (014)
--Alex (015)