[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


----- Original Message -----
From: blincoln <blincoln@ssesco.com>
To: <ba-ohs-talk@bootstrap.org>
Sent: Tuesday, April 30, 2002 12:13 PM
Subject: Re: [ba-ohs-talk] Keyword Indexing    (01)

> >since that is actually creating new words and phrases. A comma is
> >pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.
>
> A friend and I are working on a java-based keyword indexer at the moment
and are
> confronting a problem which it seems like must have been solved 10,000
> times already.    (02)

Chemical indexing has been going on awhile, e.g.
www.garfield.library.upenn.edu/essays/V1p111y1962-73.pdf
For that matter, CML (Chemical Markup Language) *should*
have addressed the issue long ago. Based on a bit of googling,
the problem seems to be an embarrassment of riches -- too
many methods, like
www.iee.org.uk/publish/support/inspec/document/ChemNum/stncni.pdf
Maybe someone at ASIS (www.asis.org) could tell you if there's
now a standard syntax for making chem names retrievable?    (03)


  It is a necessary requirement that the indexer be capable of
> indexing highly technical words.  Our area involves a lot of chemical
notation
> which includes commas _not_ as delimiters but as word-chars.  Like:
>
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
>
> They can get more complicated.  We have not yet found any adequate
solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that
have
> gone through some iterations in development, let me know.  It seems like
> the only real chance for tokenizing technical language properly must be
> some sort of dictionary lookup?  What a pain.
>
> The following characters can be part of a single chemical notation: , [ ]
( ) + -
> (minus and dash)
>
> So far, it looks like we will just lose a large portion of the chemical
notations
> to the indexer.  Another idea I've been toying with is the idea of
tokenizing twice
> and indexing both results..  So I would have a set of "always delimiters"
which
> would break words for both sets (space is always a delimiter), and "normal
> delimiters" which would be things like the comma, bracket, parentheses.
>
> Create both list of tokens for a given text block, creating a 'normal
list' and
> a 'technical list' (which is not tokenized with the 'normal delimiters').
> Remove from the 'technical list' any keys that do not contain any of the
> technical delimiters, and then I have a list of 'normal keywords' and a
list of
> possible 'technical keywords'.
>
> Sounds horrifyingly slow for the project I'm working on (a keyword indexer
> for a spider), but its the best I've come up with so far..
>
> any thoughts or ideas?
>
> bcl
>
--
________________________________
Nicholas Carroll
ncarroll@hastingsresearch.com
Travel: ncarroll1000@yahoo.com
http://www.hastingsresearch.com
________________________________
"The hardest single part of building a software system
is deciding precisely what to build." -- Frederick Brooks    (04)