Re: Great? idea for improving this list (was Re: [ba-ohs-talk] Freezope learning environments)
It's not that different from free text indexing except that the
data connecting words to paragraph ids will be available. (01)
In case it's of any interest, after a couple of hours work I now
have a perl script that will call the archives and pull out
all the new (non-reply) text from a message together with
the relevant paragraph nid information for each paragraph.
It will do this for all the messages currently in the web archive. (02)
All I have to do now is:
Grab a suitable list of stopwords off the net to feed the
hashing exclusion.
Knock together a few hash data structures to
build the index data.
Build an output routine to throw this into some neat HTML pages
and bingo! we'll all have a keyword access to the archive, and
secondly we will all be able to make whatever lovely graphs we
all feel like making out of the lexical-locator data. (03)
Maybe those bits are difficult. Maybe they aren't.
But I'm not quitting yet. (04)
Oh yeah, the end result should generalise to any mhonarc mail output. (05)
Enough talking. I'm busy. (06)
--
Peter (07)
----- Original Message -----
From: <cdent@burningchrome.com>
To: <ba-ohs-talk@bootstrap.org>
Sent: Friday, April 26, 2002 11:29 PM
Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
Freezope learning environments) (08)
>
> [archive_access.practical]
>
> On Fri, 26 Apr 2002, Peter Jones wrote:
>
> > What I was suggesting was a system that:
> >
> > a) Reads an email and sucks out each word in turn.
> > b) Each new word has a database record created, and the
> > locations of occurrence of the term in another related table.
> > Leaving aside the issue of polysemy for a moment, the
> > record structure would be something like
> > PK_ID, word_string <--relation--> FK_ID, location(s).
> > c) To improve the scanning process, have a subroutine that
> > discards the stop-words chosen, and clean the database of
> > these.
> > d) Repeat for each mail.
> > e) If a word is re-encountered then only the new location for
> > the word is inserted in the database in the appropriate new tuple.
>
> In what ways are you imaginging this being different from a free
> text index of the mail archive that gets reindexed every time a
> new message comes in?
>
> > What you then get is an index for every mail in the archive that
> > contains all the interesting words in all the mails in the archive
and
> > the locations in the mails of all those words.
>
> Is it that the list of words indexed is more limited?
>
> > Sophistication could be added in the read-in phase.
> > For example, polysemy might be attacked by some algorithm that
> > makes guesses about the word type based on a grammar.
> > Locations might be narrowed to paragraphs by chunking them
beforehand.
> > And so on.
>
> You make this sound easy. After watching the list for a while it
> is clear that we don't have the collective time for this measure
> of complexity. Are we talking about implementing something to
> use now and experiment and develop, or are we talking about an
> ideal eventual system that would work in a variety of capacities?
>
> We can talk the theory (I'd love to) but that stuff has been
> beaten to death here and elsewhere. How do we distinguish between
> the speculative talk and the plans for action?
>
> --
> Chris Dent <cdent@burningchrome.com>
http://www.burningchrome.com/~cdent/
> "Mediocrities everywhere--now and to come--I absolve you all! Amen!
> -Salieri, in Peter Shaffer's Amadeus
>
> (09)