[Evolution] Bogofilter server side, instead of SpamAssassin client side, but doing learning client side with Evo's UI

From: Andrew Cowie <andrew operationaldynamics com>
To: Evolution <evolution lists ximian com>
Subject: [Evolution] Bogofilter server side, instead of SpamAssassin client side, but doing learning client side with Evo's UI
Date: Thu, 05 May 2005 13:58:20 +1000

Hey,

Just want to describe an alternate (not spamassassin) based spam
filtering & training setup I've been using successfully with Evolution,
in case anyone is interested.

So I use bogofilter ( http://bogofilter.sourceforge.net/ ) as my spam
identifier. I've had good luck with it over the past couple years.
Indeed, I tried switching back to spamassassin once Evo 1.5 got serious
about the built in hooks to fire up spamd and use spamc to talk to it,
but after a while found that I was really unhappy with spamassassin's
performance [at correctly identifying spam].

More importantly, I do my bogofilter-ing server side, inline the
delivery pipe that my [ISP's] server uses (they run Qmail, and it's easy
to hook into the local delivery process).

So that left me with the problem of wanting to use Evolutions Junk / Not
Junk buttons to train my filter (with the resultant wordlist file
sitting on the client) but wanting to have that wordlist up on the
server for the bogofilter there to work off of.

The solution was three fold: override what Evo does to train, rsync the
word list serverside and do the actual scanning there, but do the actual
"check messages headers and sort accordingly" in a Evo client side rule.
In order:

(1)

First, glancing at the code in em-junk-filter.c, I was able to figure
out what calls Evo is making when one presses the Junk or Not Junk
buttons. It composes a command line along the lines of 

        sa-learn --spam --norebuild < MESSAGE_DATA
and
        sa-learn --ham --norebuild < MESSAGE_DATA

So what I did was override the sa-learn file [1]. Since I didn't want to
try and replace a system binary (whether or not spamassassin was
installed) [2], I wrote a tiny wrapper script and stuck it in ~/bin. The
wrapper intercepts the call to sa-learn, and instead calls bogofilter -s
or -n, as appropriate, to learn. I attached my script for anyone
interested.

Of course, to ensure that Evo sees my script instead of /usr/bin/sa-
learn, I need to invoke Evolution as

        PATH=~/bin:$PATH /usr/bin/evolution

Which isn't that big of a deal [3].

(2)

I now have a growing, better trained ~/.bogofilter/wordlist.db on my
client machine. 

But I want to do the actual scanning server side, because it means that
the CPU work of spam checking and preliminary sorting will be done ahead
of time, before I see the messages.

So I simply use rsync to push that file to the server. Nothing more
complicated than

        rsync   --verbose \
                --recursive \
                -e /usr/bin/ssh \
                --partial \
                --progress \
                ~/.bogofilter afcowie server mycolo com:/home/afcowie

On the server, my delivery instruction (a .qmail file) is along the
lines of

        | /var/qmail/bin/preline /home/afcowie/bin/bogofilter -H -e -p \
                | /home/afcowie/bin/maildrop


The -e -p to bogofilter passes messages through regardless (don't want
positives to be bounced right there, tempting as that may be, because we
want to be able to train false positives and false negatives on the
client in Evo with those terrific zippy Junk / Not Junk buttons!)...

... and maildrop (think procmail) has a really great little mail sorting
language, see http://www.courier-mta.org/maildropex.html . So server
side I do preliminary sorting of traffic to folders titled Clients,
Boards, and Lists (just so that if I *am* using webmail, I have a chance
in hell of seeing messages from my customers - also helps downstream
when composing rules for vFolders in Evo). Note that I *don't* railroad
a message marked with X-Spam-Status: Yes off to a ProbableSpam folder or
whatever  because if, in Evo, I find a false positive or negative, I
want to be able to train it using Evo's wonderful UI.

(3)

New messages are fetched by Evolution's IMAP code across four folders.

In combination with NotZed's one liner "apply filters to all IMAP
folders" patch [4], I set up an incoming Filter set up to look for X-
Spam-Status: Yes, and if so, does "Set Status" as "Junk" (puts it in the
Junk auto-vfolder) & "Set Status" as "Read" (so that it doesn't clutter
my unread counts). [5]

And done!

If I get a wrongly classified message, I use the {Junk | Not Junk}
buttons. Evo moves the message {to Junk meta folder | back to the folder
it came from and should have been in}, and calls sa-learn (which I've
overriden to call bogofilter} to learn from the mistake.

And I periodically push via rsync bogofilter's wordlist up to the
server. [Note I'm not using autolearn server side, because then there
would be a two way sync problem, and there's no reason to, really]

And it all Just Works (tm)

AfC
Sydney


[1] This is all highly dependent on the exact form of the exec calls in
em-junk-filter.c . If those change, this will need to be tweaked.

[2] In fact, it turns out that the training code attempts to activate
spamd, and if it fails, bails out without doing any training. That's not
very good, because it means in my case I have to have SpamAssassin
installed, just so Evo can start it, just so I can ignore it and do
bayes training. However, I'd say more generally that firing up spamd is
is unnecessary if all the user is doing is training (indeed, if they
don't have "filter incoming messages for junk selected) then that fire-
up-spamd should never need to happen - but still, allow the training
cycle to occur.

[3] But it sure would be nice if I could just tell evo what training
program to use. Devs aren't about to write that UI, I know.

[4] No problems with the filter on all folders thing so far!

[5] I know Jeff is going to be working on the IMAP code again sometime
soon. It seems like under POP the messages get passed to the filters
before they show up as unread in a folder; in my IMAP case, I get a blob
of unread messags in INBOX, then half a second later they vanish as they
get Junk classified. Not sure if that's fixable.

[6] hey, so I just attached a little shell script as an example, but
it's showing up as MIME type application/x-shellscript . I certainly
wouldn't want anyone's client to try and just *run* this script (its not
like its a photo which needs a viewer) - I want to deliver it as
text/plain so people can glance at it if they want to. How do I do that?
Hm. Anyway, to workaround and achieve text/plain, I stripped the
#!/bin/sh line. 

-- 
Andrew Frederick Cowie

OPERATIONAL DYNAMICS
Operations Consultants and Infrastructure Engineers

http://www.operationaldynamics.com/

Attachment: sa-learn
Description: Text document

Attachment: signature.asc
Description: This is a digitally signed message part

Follow-Ups:
- Re: [Evolution] Bogofilter server side, instead of SpamAssassin client side, but doing learning client side with Evo's UI
  - From: Xavier Bestel

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]