[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] backlink database data


On Sat, 8 Dec 2001, Murray Altheim wrote:    (01)

> I would recommend against any programmatic changes in people's emails.
> There are legal issues to begin with. Some people have legal disclaimers
> in their signatures that are required of them in their jobs, etc.    (02)

By excluding .sigs, I meant excluding them when extracting backlinks, not
removing them from the e-mail.    (03)

> Secondly, because signatures are not consistent, and because code
> segments are included in this type of list and are unpredictable, it's
> likely that any selected algorythm would truncate a message here or
> there. And what's the real value here? A small amount of saving in
> diskspace or downloaded content at the expense of modifying the content
> that people send to a list.    (04)

Correcting the context of this comment, this is still an interesting
point.  Would excluding .sigs from processing really help improve the
"quality" of the data in the backlink database?    (05)

I did a quick run-through of ba-unrev-talk and ba-ohs-talk.  Of 31 unique
authors, only six use .sigs at all.  Of those six, four have URLs in their
.sigs -- me, Murray, Lee, and Grant.  Of these four, three of us use two
dashes to separate the .sigs from the body of the message.  And of these
three, only Lee uses "-- ". :-)    (06)

So what would be the effect of excluding anything following /^--\s*$/ from
processing?  At the very least, because I'm a frequent poster, it would
remove a whole lot of useless links to my web site.  From a quick visual
scan, it doesn't look like it would wrongly exclude any data from
processing.    (07)

However, it would introduce an inconsistency in the data, in that it would
only exclude URLs from .sigs it clearly recognizes as .sigs.  That
inconsistency may be significant enough to confuse people as to how the
extraction algorithm works, and once you have people worrying about the
implementation, you have a usability problem.    (08)

My conclusion from all this?  No harm in trying it; we can always switch
back and reprocess the archives.    (09)

-Eugene    (010)

-- 
+=== Eugene Eric Kim ===== eekim@eekim.com ===== http://www.eekim.com/ ===+
|       "Writer's block is a fancy term made up by whiners so they        |
+=====  can have an excuse to drink alcohol."  --Steve Martin  ===========+    (011)