[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


At 02:28 PM 4/29/02 +0100, you wrote:
Eric Armstrong wrote:

Murray Altheim wrote:

I suggest this:

   [KEYS: word1, word2, word3 ]
Makes sense to me. Especially if not case sensitive.


The big thing is being able to (case insensitively) grep on "[keys:]".
One can't simply use square brackets because they show up all the
time in both program code and prose, eg., [Humbert, 1999].

Ok, Murray, come on now.  I read the email spec, and I see that separating headers from the email body might not be so trivial as to be accomplishable by using grep, (you can grep for the first occurrence of a double new line, but then you have to look at the Next line), but why should we let that stop us.

So the parsing program is going to have to be a bit longer, but so what.  I think that the small bit of effort that it's going to take to write a longer parser far outweighs the collective nuisance of having to type "keys:" before every keyword section.

If the first line of a post with a "[" then I think that we can just assume that it's the start of a keyword section.  (A check for a closing bracket would make this even more certain)  I do not recall any emails that have started with an open square bracket.  Maybe there will be a slight bit of noise some unusual posts, but I think that we can manually delete any keywords generated in this way.



Alex Shapiro wrote:

*3* KWD FORMAT: We need to agree on some sort of word separation
standard for keywords.  The above thread has contained the following
formats: FooBar, Foo_Bar, Foo-Bar.
Given the lack of a selection interface, ease and speed of typing is the jey, so
to
speak. Underscore is the hardest to hit. Dash is easier. Capital letters easest,
due to long practice and familiarity.


You really can't use underscores in text that may become a hypertext

link, since it's impossible to ascertain the underscores (given that
most hypertext link display styles are already underlined). I don't
think phrases should be altered by adding hyphens or camelcasing them,
since that is actually creating new words and phrases. A comma is
pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.

If you camelcased words, you could never differentiate between the
phrases and say, product names, eg., is "FooBar" a mnemonic for "foo bar",
"Foo Bar", or "FooBar"? I wouldn't strongly recommend we not encourage
people to corrupt their choice of key words or phrases, since the search
engines would have to disentangle them later (adding ambiguity and
confusion).

Ok, so comma separated keywords with spaces in the middle sounds good to me.  Both http://www.fury.com and http://www.designweenie.com/thoughts.php (scroll down) have plain text keys with spaces in the middle.  I like the way designweenie has ":" marks to indicate a hierarchy.  I guess that keywords with spaces separated by "." signs would look fine.  Ex:
[search engine.all the web, two words.two more]
Then again this is less readable then
[SearchEngine.AllTheWeb, TwoWords.TwoMore]
...
But if they are underlined!
[search engine.all the web, two words.two more]

Perfect, the keywords will only look wierd when initially written, but later when they are turned in to hyperlinks, readability will be restored.



*4.2* FINE GRAINED KEYWORDS:  Besides basic keywords there can also be
fine grained keywords, such as IBIS, Google, Graphs, etc.  My suggestion
is that instead of wasting time arguing about these, we allow any user
to use any keyword.  New keywords will automatically be added to a database.
Yes.
Sounds good to me, too.

The idea is that we should eventually settle on some common keywords by
convention. ...
Look at the SUO list to see that this is a pointless battle. You shouldn't
try to legislate this at all. I think we should have a topic map engine
and some manual labour to provide synonym matching periodically,
Perhaphs [KEYS: x=y] could create a synonym? Of the course, that only
works if a word doesn't like to have multiple meanings, depending on
context. We're going to need to avoid those anyway, though.


I think this would be done in the map, not the document. Synonyms
should be handled there, and if unwanted merges happen between topics,
one can alter the map to say "these two things are not the same."

Murray

Perfect.  The map is exactly where such manipulation should be done.

--Alex


......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK

     In the evening
     The rice leaves in the garden
     Rustle in the autumn wind
     That blows through my reed hut.  -- Minamoto no Tsunenobu