[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

[ba-ohs-talk] Progress on a project to organize email archives



I'm posting this to both ohs and unrev because we were asked to
post to unrev, but it seems directly relevant to recent ohs
conversations. I know many are on both lists, so perhaps people
in the know can choose the appropriate list to keep things clean.    (01)

Alex and Rod's discussion about tagging subject postings with
keywords to help in grouping and identifying messages is a good
cue for inviting people to look at some related work in need of
feedback. Henry and Eugene have been aware of the work and
suggested that we post here.    (02)

Kathryn La Barre and I have been working with the archive of the
unrev-II list to experiment with computational methods for
determining "aboutness" of the subjects, messages and threads. This
"aboutness" could then be used in an iterative process to
generate facets for a faceted access structure to the archive.    (03)

Our work was inspired by conversations on these lists that
happened throughout the winter.    (04)

We are not very far along, but have laid the groundwork for some
interesting research. What we have done thus far is gathered at
the following URL:    (05)

  http://ella.slis.indiana.edu./~klabarre/unrev_firstpage.html    (06)

A general hypothesis is that tools such as latent semantic
analysis, vector space models, traditional concordancing, self
organizing maps (basically the gamut of semantic analysis tools)
may be worthwhile tools for generating meaningful clusters in
the dataset. These clusters could then be used as aids in the
human process of facet analysis to generate the skeleton of a
faceted access structure.    (07)

The novelty of this approach is that instead of taking facets
from the universe and applying them to the archive (the
traditional method of facet analysis), we are attempting to draw
the facets out of the archive.    (08)

This is explained with more detail and clarity at the URL. I'm in
a rush at the moment with an end of the semester crush. I hope
you'll forgive any confusion or weirdness here and just check out
the URL :)    (09)

As with all pursuits, the work that Kathryn and I are doing could
do with some feedback. We're fairly certain that we are on to
something potentially interesting but it needs some air and
evaluation. If you have some comments, of any sort, we welcome
them. Our email addresses are at the URL, or post to the list
(whichever seems most appropriate).    (010)

That (above) is the main part of this posting. It represents the
work we are doing which we think has some long term relevance,
not just for the unrev-II archive but other archives as well.    (011)

For those of you with a need for something to play with, a large
number of the unrev-ii message are now in a database and
accessible in very rough form at    (012)

  http://ella.slis.indiana.edu/~cjdent/unrev/index.cgi    (013)

This tool is the framework for a more complex tool which will
allow for the retrieval of message clusters (based on the cluster
creation described above). Retrieved message can then be
evaluated and tags added to help in the faceting process.    (014)

Right now you can search in the from, the subject or the body for
simple expressions (phrases are boolean ands from anywhere in the
selected field). The searches are not by word: this is searching
for fragments in Oracle LOBS (which I hate). It is slow (the
hardware being used is too busy) and will need to adjusted but is
good for experimenting.    (015)

Once you make one search the interface gets a little more
interesting, as there is much linking. Unfortunately, it doesn't
yet identify and make active hypertext links. Much is in the
works but as I've said: the end of the semester crunch is on and
this work is not tied to that timetable.    (016)

Thanks for reading.    (017)

-- 
Chris Dent  <cdent@burningchrome.com>  http://www.burningchrome.com/~cdent/
"Mediocrities everywhere--now and to come--I absolve you all! Amen!
 -Salieri, in Peter Shaffer's Amadeus    (018)