[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


blincoln wrote:
 >Murray Altheim wrote:    (01)

>>since that is actually creating new words and phrases. A comma is    (02)

>>pretty simple to type and is (in English) a _word_ or _phrase_delimiter_
> 
> A friend and I are working on a java-based keyword indexer at the moment and are
> confronting a problem which it seems like must have been solved 10,000
> times already.  It is a necessary requirement that the indexer be capable of
> indexing highly technical words.  Our area involves a lot of chemical notation
> which includes commas _not_ as delimiters but as word-chars.  Like:
> 
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.  
> 
> They can get more complicated.  We have not yet found any adequate solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that have
> gone through some iterations in development, let me know.  It seems like 
> the only real chance for tokenizing technical language properly must be
> some sort of dictionary lookup?  What a pain.    (03)


A slight modification to my original proposal for this problem [nid09]
could solve this:    (04)

   [urn:ohs:keys:chem:
     N,N-dimethyltryptamine ;
     3,4-methylpropylamine]    (05)

The idea is that rather than something so simple as "keys:" we would
use a URN to identify the semantics and parsing rules of a specific
scheme. The parsing rules for "urn:ohs:keys:" could be commas, for
"urn:ohs:keys:chem:" could be something different, say, semicolons
(or whatever suits) for each token, whitespace-trimmed at both end.    (06)

There's no way to solve this IMO generally without creating a
hierarchical namespace identifier under which each of these schemes
operate. We could default "keys:" as a shortcut for "urn:ohs:keys:"
(or whatever base URN we chose), and in fact, default various non-
English equivalents to that same URN. With a topic map behind the
engine it's not going to matter so long as the URNs of each point
to the same subject. Absent topic maps, it's still only a table
lookup, not too difficult.    (07)

Murray    (08)

[nid09] http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00180.html#nid09
......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK    (09)

      In the evening
      The rice leaves in the garden
      Rustle in the autumn wind
      That blows through my reed hut.  -- Minamoto no Tsunenobu    (010)