[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

RE: [ba-ohs-talk] Organic Structure to Organize and Retrieve the Record


Jack,    (01)

Perl has direct support for disk based hashtables that appear to the program
as memory structures.
Berkeley DB http://www.sleepycat.com  is an open source database system
providing both hash and b-tree structures that can interface to nearly any
language.    (02)

Thanks,    (03)

Garold (Gary) L. Johnson    (04)

-----Original Message-----
From: owner-ba-ohs-talk@bootstrap.org
[mailto:owner-ba-ohs-talk@bootstrap.org]On Behalf Of Jack Park
Sent: Wednesday, June 05, 2002 7:00 PM
To: ba-ohs-talk@bootstrap.org
Subject: Re: [ba-ohs-talk] Organic Structure to Organize and Retrieve the
Record    (05)

Hashtables?
Take a look at http://dbh.sourceforge.net/
Disk-based hashtables
"A DBH is a convenient way to associate keys composed by characters to data
records. Any kind of digital information can go into the data record, such
as text, graphic information, database structures, you name it. The idea
behind using a DBH is to get rid of what is known as an index file in the
database world. In the DBH world, the index is built into the file format."
"DBH extends the concept of binary trees into a n-tuple dimensional space.
This way the trees it creates are much more like the trees we actually see
in nature, and nature is a wise thing. "    (06)

Jack    (07)

At 06:11 PM 6/5/2002 -0700, you wrote:
>Gary,
>
>This is getting us into a useful subject.
>
>Rod
>
>*************
>
>"Garold (Gary) L. Johnson" wrote:
> >
> > FWIW,
> >
> > By taking sequences of non-noise words to some number, you begin to
build a
> > phrase dictionary as well as a word dictionary, and that can prove more
> > useful.
> >
> > Give this phrase list and the tools to establish relationships within
the
> > list, it is possible to develop a faceted thesaurus, which is not too
far
> > distant from a topic map.
> > One of Neil Larson's DOS programs does this and he used it to organize
huge
> > hypertext systems.
> >
> > Single words are a start, but this extension should be easy to add (more
so
> > that the initial work).
> >
> > Thanks,
> >
> > Garold (Gary) L. Johnson
> >
> > -----Original Message-----
> > From: owner-ba-ohs-talk@bootstrap.org
> > [mailto:owner-ba-ohs-talk@bootstrap.org]On Behalf Of Peter Jones
> > Sent: Saturday, April 27, 2002 9:44 AM
> > To: ba-ohs-talk@bootstrap.org
> > Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
> > Freezope learning environments)
> >
> > It's not that different from free text indexing except that the
> > data connecting words to paragraph ids will be available.
> >
> > In case it's of any interest, after a couple of hours work I now
> > have a perl script that will call the archives and pull out
> > all the new (non-reply) text from a message together with
> > the relevant paragraph nid information for each paragraph.
> > It will do this for all the messages currently in the web archive.
> >
> > All I have to do now is:
> > Grab a suitable list of stopwords off the net to feed the
> > hashing exclusion.
> > Knock together a few hash data structures to
> > build the index data.
> > Build an output routine to throw this into some neat HTML pages
> > and bingo! we'll all have a keyword access to the archive, and
> > secondly we will all be able to make whatever lovely graphs we
> > all feel like making out of the lexical-locator data.
> >
> > Maybe those bits are difficult. Maybe they aren't.
> > But I'm not quitting yet.
> >
> > Oh yeah, the end result should generalise to any mhonarc mail output.
> >
> > Enough talking. I'm busy.
> >
> > --
> > Peter
> >
> > ----- Original Message -----
> > From: <cdent@burningchrome.com>
> > To: <ba-ohs-talk@bootstrap.org>
> > Sent: Friday, April 26, 2002 11:29 PM
> > Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
> > Freezope learning environments)
> >
> > >
> > > [archive_access.practical]
> > >
> > > On Fri, 26 Apr 2002, Peter  Jones wrote:
> > >
> > > > What I was suggesting was a system that:
> > > >
> > > > a) Reads an email and sucks out each word in turn.
> > > > b) Each new word has a database record created, and the
> > > > locations of occurrence of the term in another related table.
> > > > Leaving aside the issue of polysemy for a moment, the
> > > > record structure would be something like
> > > > PK_ID, word_string <--relation--> FK_ID, location(s).
> > > > c) To improve the scanning process, have a subroutine that
> > > > discards the stop-words chosen, and clean the database of
> > > > these.
> > > > d) Repeat for each mail.
> > > > e) If a word is re-encountered then only the new location for
> > > > the word is inserted in the database in the appropriate new tuple.
> > >
> > > In what ways are you imaginging this being different from a free
> > > text index of the mail archive that gets reindexed every time a
> > > new message comes in?
> > >
> > > > What you then get is an index for every mail in the archive that
> > > > contains all the interesting words in all the mails in the archive
> > and
> > > > the locations in the mails of all those words.
> > >
> > > Is it that the list of words indexed is more limited?
> > >
> > > > Sophistication could be added in the read-in phase.
> > > > For example, polysemy might be attacked by some algorithm that
> > > > makes guesses about the word type based on a grammar.
> > > > Locations might be narrowed to paragraphs by chunking them
> > beforehand.
> > > > And so on.
> > >
> > > You make this sound easy. After watching the list for a while it
> > > is clear that we don't have the collective time for this measure
> > > of complexity.  Are we talking about implementing something to
> > > use now and experiment and develop, or are we talking about an
> > > ideal eventual system that would work in a variety of capacities?
> > >
> > > We can talk the theory (I'd love to) but that stuff has been
> > > beaten to death here and elsewhere. How do we distinguish between
> > > the speculative talk and the plans for action?
> > >
> > > --
> > > Chris Dent  <cdent@burningchrome.com>
> > http://www.burningchrome.com/~cdent/
> > > "Mediocrities everywhere--now and to come--I absolve you all! Amen!
> > >  -Salieri, in Peter Shaffer's Amadeus
> > >
> > >    (08)