Re: [ba-ohs-talk] Keyword Indexing
blincoln wrote: (01)
> A friend and I are working on a java-based keyword indexer at the moment and are
> confronting a problem which it seems like must have been solved 10,000
> times already. It is a necessary requirement that the indexer be capable of
> indexing highly technical words. Our area involves a lot of chemical notation
> which includes commas _not_ as delimiters but as word-chars. Like:
>
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
>
> They can get more complicated. We have not yet found any adequate solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that have
> gone through some iterations in development, let me know. It seems like
> the only real chance for tokenizing technical language properly must be
> some sort of dictionary lookup? What a pain. (02)
I've played with the Java regular expression package for a while now, and
may have an interesting idea for you. (03)
The idea would be to define patterns of terms. In the example above,
I see the pattern: digit comma digit hyphen letter+ (04)
The idea would be to start building up a collection of start/end locations for
terms in the file. It might have to be inspected in multiple passes, for different
kinds of terms. If a pattern was matched that started and ended inside an
existing term, it could be ignored. If it started xor ended an existing term, the
term-boundaries could be extended. (05)
Meta term patterns could be applied. For example, the pattern
term hyphen term (06)
would be applied by selecting adjacent terms in the collection. If they were
separated by one character, and that character was a hypen, the whole
thing would be a term, so the two items in the collection would be combined. (07)
At that point, all of the terms have been identified, and normal parsing rules
could be applied to the remainder of the text. (08)
That may or may or not help, in your case. It's the way I would think about
the problem any way. (I suspect it would work, but I'd be hard pressed to
defend the performance profile.) (09)
Anyway, it sounds like a heckuva interesting challenge! (010)