[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


Actually, I've just figured out what \B is doing.    (01)

I should point out that it also splits things like   into    (02)

&n~~b~~s~~p;~~    (03)

But in those cases you should be able to split off the unwanted
punctuation.    (04)

--
Peter    (05)

----- Original Message -----
From: "Peter Jones" <ppj@concept67.fsnet.co.uk>
To: <ba-ohs-talk@bootstrap.org>
Sent: Tuesday, April 30, 2002 10:00 PM
Subject: Re: [ba-ohs-talk] Keyword Indexing    (06)


> Well, I've seen quite a few papers that say that there aren't any easy
> answers to indexing
> punctuations in English.
>
> In playing with perl against the bootstrap archives I came up with the
> following partial step in the
> right direction.
>
> @wurddat = split( /\s+/, $sntnc);    #split (by whitespace) a sentence
> into an array
>
> foreach $teststr (@wurddat) {     #for each string in the wurddat
> array...
>
>       @chardat = split(/\B/, $teststr);    #...split on a non-word
> boundary
>
>                  foreach $char (@chardat)   #test printout for dbug
>                  {
>                  print "$char~~";
>                  }
>                  print "\n";
> }
>
> I have no idea why really, but it yields a split of the word that
looks
> like:
> N,N-d~~i~~m~~e~~t~~h~~y~~l~~t~~r~~y~~p~~t~~a~~m~~i~~n~~e~~
> where '~~' is just printout data so that I can see where the splits
are.
> The characters that have punctuation immediately between them don't
get
> split apart.
>
> I'm now working on an algorithm that picks up that data to create the
> words for the index.
>
> HTH,
> --
> Peter
>
>
>
> ----- Original Message -----
> From: "blincoln" <blincoln@ssesco.com>
> To: <ba-ohs-talk@bootstrap.org>
> Sent: Tuesday, April 30, 2002 8:13 PM
> Subject: Re: [ba-ohs-talk] Keyword Indexing
>
>
> >
> > >since that is actually creating new words and phrases. A comma is
> > >pretty simple to type and is (in English) a _word_ or
> _phrase_delimiter_.
> >
> > A friend and I are working on a java-based keyword indexer at the
> moment and are
> > confronting a problem which it seems like must have been solved
10,000
> > times already.  It is a necessary requirement that the indexer be
> capable of
> > indexing highly technical words.  Our area involves a lot of
chemical
> notation
> > which includes commas _not_ as delimiters but as word-chars.  Like:
> >
> > N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the
sort.
> >
> > They can get more complicated.  We have not yet found any adequate
> solution
> > but if anyone knows of any keyword parsing / tokenizing rule sets
that
> have
> > gone through some iterations in development, let me know.  It seems
> like
> > the only real chance for tokenizing technical language properly must
> be
> > some sort of dictionary lookup?  What a pain.
> >
> > The following characters can be part of a single chemical notation:
,
> [ ] ( ) + -
> > (minus and dash)
> >
> > So far, it looks like we will just lose a large portion of the
> chemical notations
> > to the indexer.  Another idea I've been toying with is the idea of
> tokenizing twice
> > and indexing both results..  So I would have a set of "always
> delimiters" which
> > would break words for both sets (space is always a delimiter), and
> "normal
> > delimiters" which would be things like the comma, bracket,
> parentheses.
> >
> > Create both list of tokens for a given text block, creating a
'normal
> list' and
> > a 'technical list' (which is not tokenized with the 'normal
> delimiters').
> > Remove from the 'technical list' any keys that do not contain any of
> the
> > technical delimiters, and then I have a list of 'normal keywords'
and
> a list of
> > possible 'technical keywords'.
> >
> > Sounds horrifyingly slow for the project I'm working on (a keyword
> indexer
> > for a spider), but its the best I've come up with so far..
> >
> > any thoughts or ideas?
> >
> > bcl
> >
> >
> >
> >
>
>    (07)