Re: [ba-ohs-talk] Keyword Indexing
Well, I've seen quite a few papers that say that there aren't any easy
answers to indexing
punctuations in English. (01)
In playing with perl against the bootstrap archives I came up with the
following partial step in the
right direction. (02)
@wurddat = split( /\s+/, $sntnc); #split (by whitespace) a sentence
into an array (03)
foreach $teststr (@wurddat) { #for each string in the wurddat
array... (04)
@chardat = split(/\B/, $teststr); #...split on a non-word
boundary (05)
foreach $char (@chardat) #test printout for dbug
{
print "$char~~";
}
print "\n";
} (06)
I have no idea why really, but it yields a split of the word that looks
like:
N,N-d~~i~~m~~e~~t~~h~~y~~l~~t~~r~~y~~p~~t~~a~~m~~i~~n~~e~~
where '~~' is just printout data so that I can see where the splits are.
The characters that have punctuation immediately between them don't get
split apart. (07)
I'm now working on an algorithm that picks up that data to create the
words for the index. (08)
HTH,
--
Peter (09)
----- Original Message -----
From: "blincoln" <blincoln@ssesco.com>
To: <ba-ohs-talk@bootstrap.org>
Sent: Tuesday, April 30, 2002 8:13 PM
Subject: Re: [ba-ohs-talk] Keyword Indexing (010)
>
> >since that is actually creating new words and phrases. A comma is
> >pretty simple to type and is (in English) a _word_ or
_phrase_delimiter_.
>
> A friend and I are working on a java-based keyword indexer at the
moment and are
> confronting a problem which it seems like must have been solved 10,000
> times already. It is a necessary requirement that the indexer be
capable of
> indexing highly technical words. Our area involves a lot of chemical
notation
> which includes commas _not_ as delimiters but as word-chars. Like:
>
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
>
> They can get more complicated. We have not yet found any adequate
solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that
have
> gone through some iterations in development, let me know. It seems
like
> the only real chance for tokenizing technical language properly must
be
> some sort of dictionary lookup? What a pain.
>
> The following characters can be part of a single chemical notation: ,
[ ] ( ) + -
> (minus and dash)
>
> So far, it looks like we will just lose a large portion of the
chemical notations
> to the indexer. Another idea I've been toying with is the idea of
tokenizing twice
> and indexing both results.. So I would have a set of "always
delimiters" which
> would break words for both sets (space is always a delimiter), and
"normal
> delimiters" which would be things like the comma, bracket,
parentheses.
>
> Create both list of tokens for a given text block, creating a 'normal
list' and
> a 'technical list' (which is not tokenized with the 'normal
delimiters').
> Remove from the 'technical list' any keys that do not contain any of
the
> technical delimiters, and then I have a list of 'normal keywords' and
a list of
> possible 'technical keywords'.
>
> Sounds horrifyingly slow for the project I'm working on (a keyword
indexer
> for a spider), but its the best I've come up with so far..
>
> any thoughts or ideas?
>
> bcl
>
>
>
> (011)