$Id: README,v 1.2 2003/04/28 16:42:21 karl Exp $

- What?
Another implementation of Paul Graham's algorithm for spam detection:
http://www.paulgraham.com/spam.html.
(I also looked extensively at Eric Raymond's bogofilter source while
 writing this, and stole some ideas from there.)

- Where?
http://www.cs.umb.edu/~karl/kspam and kspam.tar.gz.
Prerequisites: perl, procmail, Unix.


- Why?
Mainly because bogofilter (http://www.tuxedo.org/~esr/bogofilter)
started to think one day that just about everything coming in was spam,
and I couldn't figure out how to debug it -- no logging facilities, and
the code had enough #ifdef's that it wasn't obvious to me what was
actually executing.

bogofilter is optimized for speed.  kspam is very, very, very slow.  I
don't mind this, because I get maybe 300 messages on a busy day -- if it
takes a few seconds to process a message, that's fine by me.  Obviously
kspam would not suitable for a big shared server with hundreds of users.
(There are lots of other Graham implementations for this case, q.v.)

On the upside, kspam is pretty small (the whole thing is 500 lines of
perl), and it has lots of debugging and logging.  I suppose if it ever
needs to be fast, I could rewrite it in C.


- How?
See INSTALL for a distillation of this (but you should read it anyway,
to understand what's happening -- basically, I'm distributing my own
personal mail setup, and it's highly unlikely you'll want to use it as-is.)

** Part 1: seeding the word lists.
All probabilistic spam detection algorithms need some input (both spam
and nonspam) to start with.  You don't get spam detection starting with
message #1 -- more like message #10000.  Nonspam is easy, you can just use
all your saved messages from correspondents (it's ok if there's a few
spams sprinkled in).  For spam, I had assiduously saved all my junk mail
for quite a while, carefully removing all real messages from the junk.

Then, I ran the included program kseed to seed the word lists
(zcat spamfiles | kseed --spam; zcat nonspamfiles | kseed --nonspam).
This took many hours, but that was ok with me, I let it run overnight.


** Part 2: classifying incoming mail.
Ok, given some reasonable word lists, we can now run incoming mail
through the algorithm to filter out the spam.  I do this through procmail;
kspam doesn't do mail delivery itself, of course.  It uses
lockfile(1) from procmail for locking against itself.

See the included procmailrc for an example (this is my real
.procmailrc).  This does the following:

1) weed out duplicate messages first, using formail -D.

2) don't call kspam directly, instead call a script ~/bin/testspam (also
   included).  This is because kspam is not flexible enough -- sometimes
   I want to just do some greps to make sure some messages are treated
   as nonspam (whitelist) or spam (blacklist).

3) if a message is spam, save it in ~/misc/caughtspam.  I don't want to
   delete anything outright because of possible misclassifications, etc.


** Part 3: checks and balances.
When a spam gets through the filter, I save it (in a folder
~/mail/spam).  Then nightly I run a cron job (caughtspam, included)
which reclassifies those messages using kspam --SPAM.  An alternative
would be to reclassify the message on the spot (as esr does), but I
prefer the batch approach.

Another nightly activity is to report on the spam that *was* caught by
the filter (checkspam, included).  This is to have a hope of catching
false positives.

All the mail is saved away (in ~/misc/old/caughtspam/YYYYYMMDD) just in
case I need to seed a spam generator or something :).

Of course, you should try all this out using some test account first,
not just let it loose on your real mail.

--karl@cs.umb.edu
