[ba-ohs-talk] Greenstone as a HyperScope
I have mentioned Greenstone before. The more I play with it, the more I
tend to think that it is a HyperScope.
Here is what I know about it from playing with it and reading its
It can suck up entire directories (including sub directories) from your
It can suck up entire web sites (including sub directories <I think>). (03)
What it does:
It reads the file (types include pdf, ps, doc, txt, html, and some gif/jpg
type files) and converts them to an intermediate file (gml).
It indexes the gml files.
It also appears to do n-gram and other statistical stuff.
It also appears to have some phrase detection tools.
It says (I haven't seen it yet) it has a corba interface. (04)
If you want to add file types for it to handle, you just write a small perl
script to do the job and include that script in your "collection"
configuration file. (05)
Greenstone and all its internal programs are GPL. With a corba interface,
we can create a HyperScope interface and just let it do all the internal work. (06)
There is another initiative behind Greenstone, that of doing datamining in
the Greenstone collections. That's precisely where I hope it will go soon,
though Greenstone appears to be linked tightly into some PhD projects,
meaning it might be several years before it gets the datamining tools out
for us to play with. (07)
I suspect that Greenstone is a great candidate (I've said this before) for
a prototype HyperScope infrastructure. We just need to learn how to use it
and to extend it. (08)