[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] BA archive index


I've just run another update to produce files that are separated
by letter of the alphabet, so I now have on my harddisk
an index of ba-ohs-talk so far by letter.
Looking through the data, it looks like I'll have to edit it
a bit.
There are quite a few misspellings
('aaffecting', 'addresswing', etc.)
 and some munges need
taking out.    (01)

If anyone wants me to post it on my site without waiting for
me to edit it, shout now.    (02)

In the meantime, I haven't really been following the IP discussion.
If I just put a comment in the files to the effect that all data is
joint copyright of the Bootstrap Alliance and the respective
authors of the emails will that do?    (03)

--
Peter    (04)


----- Original Message -----
From: "Peter Jones" <ppj@concept67.fsnet.co.uk>
To: <ba-ohs-talk@bootstrap.org>
Sent: Monday, April 29, 2002 2:28 PM
Subject: [ba-ohs-talk] BA archive index    (05)


> FWIW,
>
> I've just passed my first working prototype indexer over the
> November 2001 ba-ohs-talk archive.
> It built a 1.3 MB XML file from those 33 mails alone.
> And that's with my shrinking the amount of text in there a
> fair bit. Although I could squeeze it a little more perhaps.
>
> The results are in a (240KB) zip file at
> http://www.concept67.fsnet.co.uk/xml/index.htm
>
> Apart from the preponderous presence of Eugene's sig,
> which seems to have survived all my attempts to purge it
> thus far, and the odd piece of message header data, the results are
> pretty clean (I hope you'll agree).
>
> Mysteries to ponder:
> Somehow my little perl script managed to index Eugene's sig under
> 'coma'.  :-) Maybe my code is smarter than I thought.
>
> Cheers,
> --
> Peter
>
>
> ----- Original Message -----
> From: "Peter Jones" <ppj@concept67.fsnet.co.uk>
> To: <ba-ohs-talk@bootstrap.org>
> Sent: Sunday, April 28, 2002 1:12 PM
> Subject: BA Archive Index WAS: Re: Great? idea for improving this list
> (was Re: [ba-ohs-talk] Freezope learning environments)
>
>
> > Hi,
> >
> > Determining sequences of non-noise words might be a bit beyond me as
> > yet.
> > But it also might not be necessary in this application (see below).
> > It depends where things go and how far.
> >
> > So far my prototype produces bulky data sets like this from the test
> > data (a very small sample):
> > <?xml version="1.0"?>
> > <index>
> > <word_data>
> >    <word>Bringing</word>
> >       <urloc>http://this.that.other/#nid03</urloc>
> >       <para>Bringing Paths back in...
> > [snipped a load of data]
> >       </para>
> > </word_data>
> > <word_data>
> >    <word>Cheers</word>
> >       <urloc>http://this.that.other/#nid05</urloc>
> >       <para>Cheers,
> >        Peter &nbsp;&nbsp; </para>
> > </word_data>
> > <word_data>
> >    <word>Cursor</word>
> >       <urloc>http://this.that.other/#nid02</urloc>
> >       <para>I had to read it about 10 times, but I think I'm getting
> > there.
> > [snipped a load of data]
> > create a new Cursor if the permissions match. &nbsp;&nbsp; </para>
> >       <urloc>http://this.that.other/#nid02</urloc>
> >       <para>I had to read it about 10 times, but I think I'm getting
> > there.
> > [snipped a load of data]
> > Then the Cursor object walks the Node graph and only 'picks up' a
node
> > to
> > create a new Cursor if the permissions match. &nbsp;&nbsp; </para>
> > </word_data>
> > ...[snipped lots more data]
> > </index>
> >
> > I will try to implement a ring buffer to cut the amount of paragraph
> > data
> > per word down to a couple of lines.
> > (Transclusion into the web page would improve matters, but I'll
think
> > about that later.)
> >
> > The idea is that you look up a word on the webpage/graph/whatever,
> >  and listed below the word is a set of links and a snippet of text
as
> a
> > hint,
> > so that the user can determine the
> > context (and hence the correct meaning, I hope) of the word, for
each
> > link.
> > Clicking on the link takes you to the appropriate mail in BA
archive,
> > and from
> > there you can track threads.
> > It might be good to have an outline of the thread next to the URL
for
> > the index
> > entry though(?).
> >
> > There's also a lot of cruft relating to the message data inserted by
> the
> > reply-to mechanism
> > that I need to exclude.
> > It also might be an idea to exclude folks' names.
> >
> > --
> > Peter
> >
> >
> > ----- Original Message -----
> > From: "Garold (Gary) L. Johnson" <dynalt@dynalt.com>
> > To: <ba-ohs-talk@bootstrap.org>
> > Sent: Saturday, April 27, 2002 11:42 PM
> > Subject: RE: Great? idea for improving this list (was Re:
> [ba-ohs-talk]
> > Freezope learning environments)
> >
> >
> > > FWIW,
> > >
> > > By taking sequences of non-noise words to some number, you begin
to
> > build a
> > > phrase dictionary as well as a word dictionary, and that can prove
> > more
> > > useful.
> > >
> > > Give this phrase list and the tools to establish relationships
> within
> > the
> > > list, it is possible to develop a faceted thesaurus, which is not
> too
> > far
> > > distant from a topic map.
> > > One of Neil Larson's DOS programs does this and he used it to
> organize
> > huge
> > > hypertext systems.
> > >
> > > Single words are a start, but this extension should be easy to add
> > (more so
> > > that the initial work).
> > >
> > > Thanks,
> > >
> > > Garold (Gary) L. Johnson
> > >
> >
> >
> >
>
>
>
>    (06)