Re: [ba-ohs-talk] Keyword Indexing
Actually, I've just figured out what \B is doing. (01)
I should point out that it also splits things like into (02)
&n~~b~~s~~p;~~ (03)
But in those cases you should be able to split off the unwanted
punctuation. (04)
--
Peter (05)
----- Original Message -----
From: "Peter Jones" <ppj@concept67.fsnet.co.uk>
To: <ba-ohs-talk@bootstrap.org>
Sent: Tuesday, April 30, 2002 10:00 PM
Subject: Re: [ba-ohs-talk] Keyword Indexing (06)
> Well, I've seen quite a few papers that say that there aren't any easy
> answers to indexing
> punctuations in English.
>
> In playing with perl against the bootstrap archives I came up with the
> following partial step in the
> right direction.
>
> @wurddat = split( /\s+/, $sntnc); #split (by whitespace) a sentence
> into an array
>
> foreach $teststr (@wurddat) { #for each string in the wurddat
> array...
>
> @chardat = split(/\B/, $teststr); #...split on a non-word
> boundary
>
> foreach $char (@chardat) #test printout for dbug
> {
> print "$char~~";
> }
> print "\n";
> }
>
> I have no idea why really, but it yields a split of the word that
looks
> like:
> N,N-d~~i~~m~~e~~t~~h~~y~~l~~t~~r~~y~~p~~t~~a~~m~~i~~n~~e~~
> where '~~' is just printout data so that I can see where the splits
are.
> The characters that have punctuation immediately between them don't
get
> split apart.
>
> I'm now working on an algorithm that picks up that data to create the
> words for the index.
>
> HTH,
> --
> Peter
>
>
>
> ----- Original Message -----
> From: "blincoln" <blincoln@ssesco.com>
> To: <ba-ohs-talk@bootstrap.org>
> Sent: Tuesday, April 30, 2002 8:13 PM
> Subject: Re: [ba-ohs-talk] Keyword Indexing
>
>
> >
> > >since that is actually creating new words and phrases. A comma is
> > >pretty simple to type and is (in English) a _word_ or
> _phrase_delimiter_.
> >
> > A friend and I are working on a java-based keyword indexer at the
> moment and are
> > confronting a problem which it seems like must have been solved
10,000
> > times already. It is a necessary requirement that the indexer be
> capable of
> > indexing highly technical words. Our area involves a lot of
chemical
> notation
> > which includes commas _not_ as delimiters but as word-chars. Like:
> >
> > N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the
sort.
> >
> > They can get more complicated. We have not yet found any adequate
> solution
> > but if anyone knows of any keyword parsing / tokenizing rule sets
that
> have
> > gone through some iterations in development, let me know. It seems
> like
> > the only real chance for tokenizing technical language properly must
> be
> > some sort of dictionary lookup? What a pain.
> >
> > The following characters can be part of a single chemical notation:
,
> [ ] ( ) + -
> > (minus and dash)
> >
> > So far, it looks like we will just lose a large portion of the
> chemical notations
> > to the indexer. Another idea I've been toying with is the idea of
> tokenizing twice
> > and indexing both results.. So I would have a set of "always
> delimiters" which
> > would break words for both sets (space is always a delimiter), and
> "normal
> > delimiters" which would be things like the comma, bracket,
> parentheses.
> >
> > Create both list of tokens for a given text block, creating a
'normal
> list' and
> > a 'technical list' (which is not tokenized with the 'normal
> delimiters').
> > Remove from the 'technical list' any keys that do not contain any of
> the
> > technical delimiters, and then I have a list of 'normal keywords'
and
> a list of
> > possible 'technical keywords'.
> >
> > Sounds horrifyingly slow for the project I'm working on (a keyword
> indexer
> > for a spider), but its the best I've come up with so far..
> >
> > any thoughts or ideas?
> >
> > bcl
> >
> >
> >
> >
>
> (07)