Re: [ba-ohs-talk] Keyword Indexing
I'm quite familiar with this problem. For commas, it is sufficient to simply
parse into words based on whitespace and em-dashes first, and then check for
punctuation only at the ends of words. The much trickier thing is periods...
--
Kevin Keck
keck@kecklabs.com (01)
on 2002/04/30 12:13 PM, blincoln at blincoln@ssesco.com wrote: (02)
>
>> since that is actually creating new words and phrases. A comma is
>> pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.
>
> A friend and I are working on a java-based keyword indexer at the moment and
> are
> confronting a problem which it seems like must have been solved 10,000
> times already. It is a necessary requirement that the indexer be capable of
> indexing highly technical words. Our area involves a lot of chemical notation
> which includes commas _not_ as delimiters but as word-chars. Like:
>
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
>
> They can get more complicated. We have not yet found any adequate solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that have
> gone through some iterations in development, let me know. It seems like
> the only real chance for tokenizing technical language properly must be
> some sort of dictionary lookup? What a pain.
>
> The following characters can be part of a single chemical notation: , [ ] ( )
> + -
> (minus and dash)
>
> So far, it looks like we will just lose a large portion of the chemical
> notations
> to the indexer. Another idea I've been toying with is the idea of tokenizing
> twice
> and indexing both results.. So I would have a set of "always delimiters"
> which
> would break words for both sets (space is always a delimiter), and "normal
> delimiters" which would be things like the comma, bracket, parentheses.
>
> Create both list of tokens for a given text block, creating a 'normal list'
> and
> a 'technical list' (which is not tokenized with the 'normal delimiters').
> Remove from the 'technical list' any keys that do not contain any of the
> technical delimiters, and then I have a list of 'normal keywords' and a list
> of
> possible 'technical keywords'.
>
> Sounds horrifyingly slow for the project I'm working on (a keyword indexer
> for a spider), but its the best I've come up with so far..
>
> any thoughts or ideas?
>
> bcl
>
>
> (03)