[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

[ba-ohs-talk] BA archive index


FWIW,    (01)

I've just passed my first working prototype indexer over the
November 2001 ba-ohs-talk archive.
It built a 1.3 MB XML file from those 33 mails alone.
And that's with my shrinking the amount of text in there a
fair bit. Although I could squeeze it a little more perhaps.    (02)

The results are in a (240KB) zip file at
http://www.concept67.fsnet.co.uk/xml/index.htm    (03)

Apart from the preponderous presence of Eugene's sig,
which seems to have survived all my attempts to purge it
thus far, and the odd piece of message header data, the results are
pretty clean (I hope you'll agree).    (04)

Mysteries to ponder:
Somehow my little perl script managed to index Eugene's sig under
'coma'.  :-) Maybe my code is smarter than I thought.    (05)

Cheers,
--
Peter    (06)


----- Original Message -----
From: "Peter Jones" <ppj@concept67.fsnet.co.uk>
To: <ba-ohs-talk@bootstrap.org>
Sent: Sunday, April 28, 2002 1:12 PM
Subject: BA Archive Index WAS: Re: Great? idea for improving this list
(was Re: [ba-ohs-talk] Freezope learning environments)    (07)


> Hi,
>
> Determining sequences of non-noise words might be a bit beyond me as
> yet.
> But it also might not be necessary in this application (see below).
> It depends where things go and how far.
>
> So far my prototype produces bulky data sets like this from the test
> data (a very small sample):
> <?xml version="1.0"?>
> <index>
> <word_data>
>    <word>Bringing</word>
>       <urloc>http://this.that.other/#nid03</urloc>
>       <para>Bringing Paths back in...
> [snipped a load of data]
>       </para>
> </word_data>
> <word_data>
>    <word>Cheers</word>
>       <urloc>http://this.that.other/#nid05</urloc>
>       <para>Cheers,
>        Peter &nbsp;&nbsp; </para>
> </word_data>
> <word_data>
>    <word>Cursor</word>
>       <urloc>http://this.that.other/#nid02</urloc>
>       <para>I had to read it about 10 times, but I think I'm getting
> there.
> [snipped a load of data]
> create a new Cursor if the permissions match. &nbsp;&nbsp; </para>
>       <urloc>http://this.that.other/#nid02</urloc>
>       <para>I had to read it about 10 times, but I think I'm getting
> there.
> [snipped a load of data]
> Then the Cursor object walks the Node graph and only 'picks up' a node
> to
> create a new Cursor if the permissions match. &nbsp;&nbsp; </para>
> </word_data>
> ...[snipped lots more data]
> </index>
>
> I will try to implement a ring buffer to cut the amount of paragraph
> data
> per word down to a couple of lines.
> (Transclusion into the web page would improve matters, but I'll think
> about that later.)
>
> The idea is that you look up a word on the webpage/graph/whatever,
>  and listed below the word is a set of links and a snippet of text as
a
> hint,
> so that the user can determine the
> context (and hence the correct meaning, I hope) of the word, for each
> link.
> Clicking on the link takes you to the appropriate mail in BA archive,
> and from
> there you can track threads.
> It might be good to have an outline of the thread next to the URL for
> the index
> entry though(?).
>
> There's also a lot of cruft relating to the message data inserted by
the
> reply-to mechanism
> that I need to exclude.
> It also might be an idea to exclude folks' names.
>
> --
> Peter
>
>
> ----- Original Message -----
> From: "Garold (Gary) L. Johnson" <dynalt@dynalt.com>
> To: <ba-ohs-talk@bootstrap.org>
> Sent: Saturday, April 27, 2002 11:42 PM
> Subject: RE: Great? idea for improving this list (was Re:
[ba-ohs-talk]
> Freezope learning environments)
>
>
> > FWIW,
> >
> > By taking sequences of non-noise words to some number, you begin to
> build a
> > phrase dictionary as well as a word dictionary, and that can prove
> more
> > useful.
> >
> > Give this phrase list and the tools to establish relationships
within
> the
> > list, it is possible to develop a faceted thesaurus, which is not
too
> far
> > distant from a topic map.
> > One of Neil Larson's DOS programs does this and he used it to
organize
> huge
> > hypertext systems.
> >
> > Single words are a start, but this extension should be easy to add
> (more so
> > that the initial work).
> >
> > Thanks,
> >
> > Garold (Gary) L. Johnson
> >
>
>
>    (08)