b8: readme

Index

  1. Description of b8
    1. What is b8?
    2. How does it work?
    3. What do I need for it?
    4. What's different?
    5. And why is it called b8?
  2. Update from prior versions
    1. Update from bayes-php version 0.2.1 or earlier
    2. Update from bayes-php version 0.3 to 0.3.3
  3. Installation
  4. Configuration
    1. A note on security
    2. Configuration values inside the classes
    3. Config files
    4. b8's base configuration
    5. Configuration of the lexer
    6. Configuration of the storage backend
      1. BerkeleyDB
      2. MySQL
        1. MySQL in a multi user environment
      3. SQLite
    7. Configuration of the interface
      1. Basic interface configuration
      2. Configuration of the interface's work storage backend
  5. Using b8
    1. Creating a new database
    2. Using b8 in your scripts
    3. Training b8
    4. Classifying texts
    5. The administration interface
  6. Tips on Operation
  7. References

Description of b8

What is b8?

b8 is a spam filter implemented in PHP (formerly called "bayes-php"). It is intended to keep your weblog or guestbook spam-free. The filter can be used anywhere in your PHP code and tells you if a text is spam or not, using statistical text analysis. See How does it work? for details about this.
To be able to do this, b8 first has to learn some spam and some ham example texts to decide what's good and what's not. If it makes mistakes classifying unknown texts, they can be corrected and b8 learns from what was going wrong, getting better with each example text.

Basically, b8 is a Bayesian spam filter like Bogofilter or SpamBayes, but it is not intended to classify emails. On the other hand, I don't know a good spam filter (or any spam filter that isn't just example code how one could implement a Bayesian spam filter in PHP) that is intended to filter weblog or guestbook entries, and here we have the raison d'ĂȘtre for b8 ;-)
Caused by this fact, the way b8 works is slightly different from most of the Bayesian email spam filters out there. See What's different? if you're interested in the details.

How does it work?

b8 is a naive Bayesian Spam filter, basically using the technique described in Paul Graham's article "A Plan For Spam" [1]. The improvements proposed in Graham's article "Better Bayesian Filtering" [2] and Gary Robinson's article "Spam Detection" [3] have also been considered. See also "A Statistical Approach to the Spam Problem" [4].

b8 cuts the text to classify to pieces, extracting stuff like email addresses, links and html tags. For each such token, it calculates a single probability for it being spam, based on what the filter has learned so far. When the token was not seen before, b8 trys to find similar ones using the "degeneration" described in [2] and uses the most relevant value found. If really nothing is found, a default rating is set.
Then, b8 takes the most relevant values (which have a rating far from 0.5, which would mean we don't know what it is) and calculates the probability that the whole text is spam by the inverse chi-square function described in [3].
There are some parameters that can be set which influence the filter's behaviour (see below).

In short words: you give b8 a text and get back a value between 0 and 1, saying it's ham when it's near 0 and saying it's spam when it's near 1.

What do I need for it?

Not much ;-) You just need PHP on the server where b8 should be used (b8 works both with PHP 4 and 5) and a proper storage possibility for the wordlists. I strongly recommend using a BerkeleyDB. See below how you can check if you can use it and why. If the server's PHP wasn't compiled with BerkeleyDB support, a MySQL or SQLite table can be used alternatively.

What's different?

b8 is designed to classify weblog or guestbook entries, not emails. So, it uses a slightly different technique than the others use.

My experience was that such spam entries were quite short, sometimes just "123abc" as text and a link to a suspect homepage. Some spam bots don't even make a difference between e. g. the "name" and "text" fields and post their text as email address, for example. So, b8 just takes one string to classify, making no difference between "headers" and "text".
The other thing is, that most Bayesian filters count one token one time, no matter how often it appears in the text (as Graham describes it in [1]). b8 does count how often it is there and learns or considers this. Additionally, the number of learned ham and spam texts are saved and used as the calculation base for the single possibilities. Why this? Because a text containing one link (no matter where it points to, just indicated by a "http://" or a "www.") might not be spam, but a text with 20 links in it might be.

But this also means that b8 might be good for classifying weblog or guestbook entries – but very likely, it will work quite poor when used for something else (like classifying emails). But as said above, for this task, there are a lot of very good filters out there.

And why is it called b8?

The initial name for the filter was (damn creative!) "bayes-php". There were two main reasons for searching another name: 1. "bayes-php" sucks. 2. the PHP License does not want that the name of a script written in PHP does contain "PHP". Read the License FAQ for a reasonable argumentation about this.

Luckily, Tobias Lang proposed the new name "b8". And these are the reasons why I chose this name:

  1. "bayes-php" is a "b" followed by 8 letters.
  2. "b8" is short and handy. Additionally, there was no program with the name "b8" or "bate"
  3. The English verb "to bate" means "to decrease" – and that's what b8 does: it decreases the number of spam entries in your weblog or guestbook!
  4. "b8" just sounds way cooler than "bayes-php" ;-)

So … have a lot of fun using b8 :-)

Update from prior versions

If this is a new b8 installation, read on at the Installation section!

Update from bayes-php version 0.2.1 or earlier

Please first follow the database update instructions of the bayes-php-0.3 release if you update from a version prior to bayes-php-0.3 and then read the following paragraph about updating from a version <=0.3.3.
As most people don't use such an old version (I hope so!), I haven't included the update scripts from bayes-php-0.3 in the current release anymore. If you have to do some work now: sorry ;-)

Update from bayes-php version 0.3 to 0.3.3

The configuration model of b8 has changed in version 0.4. The config files aren't PHP files anymore, so the easiest way to update is simply to change the new config files according to your needs (see below). You can just keep your database as it is. When you use MySQL as b8's storage backend, please read the following paragraph!

IMPORTANT: MySQL tables have to be updated! MySQL does not handle text primary keys case sensitive by default (… and nobody noticed it ;-), but this is really important for the filter's performance! Please alter your table executing the following query:

ALTER TABLE <table_name> CHANGE token token VARCHAR(255) BINARY

This will make tokens case sensitive. No data will be lost doing this. BerkeleyDB and SQLite users are not affected.

Installation

Installing b8 on your server is quite easy. You just have to copy the files you need to your server. To do this, you could just upload the whole b8 directory to the base directory of your homepage.
A more minimalistic installation would be just to upload what you need. That is the etc directory, the lexer directory, the storage directory with only the storage backend you want to use and the two files b8.php and shared_functions.php.
The complete interface and doc directories aren't needed for just using the filter.

That's it ;-)

If you just want to have a look at b8, you probably don't want to set up the administration interface. If you don't want to use the interface, you neither have to install it, nor to configure it. If you do want the interface, you will need a realational database (MySQL) to use all of it's functions, even if you use BerekelyDB as b8's storage backend (and you should ;-). Read below about the difference between b8's storage backend and the work storage backend used by the interface.

Anyway, the whole interface stuff is not needed by b8 to do it's work, it's just an optional feature.

Configuration

Configuration values inside the classes

All configuration values are stored in the variable config inside the class being configured and can be changed after the class was loaded with e. g. $b8->config['mindev'] = 0.1;.

A note on security

The configuration files of b8 are plain text files that will normally be sent by the web server as-is if they are requested via HTTP. For this reason, one could, in case e. g. MySQL is used as b8's storage backend and no shared connection is used, read user names and passwords in clear text if one knows the URL of the configuration file.
As a consequence (and of course also because of the administation interface if used) I strongly recommend to protect the whole directory of b8 via .htaccess to prevent unauthorized access.

Config files

All config files are found in the etc directory. These are simple text files with the syntax parameter = value. Empty lines are ignored, just as everthing behind a # is. Please note that all values in the config files are case-sensitive.
When a config file isn't found or a value isn't set, the default settings are used.

If you use Windows and edit these files with notepad.exe, don't be surprised about the missing newlines. It doesn't know how to handle UNIX line breaks since my old DOS times. perhaps, The Vista notepad does meanwhile ;-) Anyway, a configuration file edited with this brilliant piece of software will also work.

The configuration options of the particular files are described below.

b8's base configuration

b8's base configuration file is config_b8. The following values can be set:

lexerType
This defines the class used to transform a passed text to a list of tokens. At the moment, there is just one such class ("default"), so simply leave this to be default.
databaseType

This defines what storage backend should be used to save the wordlist. Three database backends are availible: dba, mysql and sqlite.

dba
This is the preferred storage backend. When choosing this option, b8 will use a BerkeleyDB to save the wordlist. This was the initial backend for the filter and remains the most performant. b8's storage model is optimized for this database, as it is really fast and fits perfectly to what the filter needs to do his work. All content is saved in a single file.
If you don't know whether your server's PHP can use a BerkeleyDB, just run the dba_versioncheck.php script from the doc directory on your server. It will list all availible DBA handlers. If there are handlers for BerkeleyDB, please use this backend.
mysql
As some webspace hosters don't allow using a BerkeleyDB (but please be sure to check if you could use it!), but most do provide a MySQL server, using a MySQL table for the wordlist is provided as an alternative storage method. As said above, b8 is programmed to use a BerkeleyDB. It doesn't use or need SQL to query the database. So, very likely, this will work less performant, produce a lot of unnecessary overhead (what you probably won't even notice anyway … ;-) and waste computing power. But it will do fine anyway!
sqlite
Laurent Goussard (loranger at free . fr) added a storage backend that uses an SQLite table. In principle, it's the same like with MySQL, but perhaps with less overhead, as this is a fast, small library that stores an SQL database within a single file without running a server.
Please notice that this breaks PHP 4. You have to use PHP 5 for this storage backend.
Please also notice that I don't and can't use this backend – so when there's a bug in it, write an email to Laurent ;-)

The following ones are settings that influence the mathematical internals of the filter. If you want to experiment, feel free to play around with them; but be warned: the wrong settings of these values will result in poor performance or even "short-circuit" the filter.
Leave these values as they are unless you are sure that your changes will result in a better performance!

The "Statistical discussion about b8" [5] shows why the default values are the default ones.

useRelevant
This tells b8 how many tokens should be used when calculating the spamminess of a text. The default setting is 15. This seems to be a quite reasonable value. When using to many tokens, the filter will fail on texts filled with useless stuff or with passages from a newspaper, etc. not being very spammish.
The tokens counted multiple times (see above) are added in addition to this value. They don't replace other ratings.
minDev
This defines a minumun deviation from 0.5 that a token's rating must have to be considered when calculating the spamminess. Tokens with a rating closer to 0.5 than this value will simply be skipped.
If you don't want to use this feature, set this to 0. Defaults to 0.2. Read [5] before increasing this.
robX
This is Gary Robinson's "x" constant. A completely unknown token will be rated with the value of robX. The default 0.5 seems quite reasonable, as we can't say if a token that also can't be rated by degeneration is good or bad.
If your experience is that you receive much more spam as ham or vice versa, you could change this setting accordingly.
robS
This is Gary Robinson's "s" constant. This is essentially the probability that the robX value is the correct one for an unknown token. It will also shift the probability of rarely seen tokens towards this value, if the token has been in ham and spam so far. The default is 0.3
See [3] for a closer description of the "s" constant and read [5] for specific information about this constant in b8's algorithms.
sharpRating
If set to TRUE, b8 does a quite harsh rating of tokens that have been only in ham or only in spam (as proposed in [2]): if the token was just in spam or ham less than 10 times, it gets rated with 0.9998 or 0.0002. With more than 10 times, it is rated with 0.9999 or 0.0001.
This has been the (non-changeable) default setting since version 0.2 but isn't anymore, as the enhanced probability calculation proposed in [3] shows significantly better results [5].
If you really want to set this to to TRUE and experience better results with it, let me know why ;-)

Configuration of the lexer

The lexer class transforms the string passed to b8 in an array of single tokens. Initially, this was a part of the filter itself, but if anybody wanted to write a "special" lexer, it can be done quite easy when it's a single class.

These are the settings in config_lexer, that the default lexer takes:

minSize
This is the minimal length a token has to have. Defaults to 3.
maxSize
This is the maximal length a token can have. Theoretically, there's no limit when using BerkeleyDB or SQLite (255 for MySQL), but it makes no sense to store very long tokens, as they would be nonsense very likely. Defaults to 30 which should be really sufficient.
allowNumbers
Sets whether to accept pure numbers ("123456"). Defaults to FALSE.

Configuration of the storage backend

The configuration of b8's storage backend is done in config_storage. One setting is common for all backends:

createDB
This sets whether a new database should be created. See below for how to do this. Defaults to FALSE. See Creating a new database for how to create a new database.

The other values to set depend of the database backend used:

BerkeleyDB

When using a BerkeleyDB, the following values can be set:

dbFile
This is the filename of the file which will be used as the database. Defaults to wordlist.db.
This path is relative to the base directory of b8, so if you put b8 in /b8/ on your server and used wordlist.db here, the file would be /b8/wordlist.db (of course in the document root of the server, not the real / root directory). If you wanted to use /wordlist.db, you could put in here ../wordlist.db.
An absolute path name starting with / will be used as-is.
dbVersion
This is the DBA database handler used when connecting to the database. Defaults to db4. When you don't know which one to use, simply run the script doc/dba_versioncheck.php on your server. It will show you all availible and suitable handlers.

MySQL

When using a MySQL table, we essentially only need one setting:

tableName
This is the name of the table containing b8's wordlist. Defaults to b8_wordlist.

If you use MySQL as storage backend for b8, your guestbook or weblog will use it, too, I think. So, if you connect MySQL anyway in the script that uses b8, simply pass the return value of mysql_connect() (which is a MySQL-link resource) to b8 as a parameter. Then, b8 will use the same resource to query MySQL.
See also the below example code doing this.

If you want b8's MySQL storage class to set up it's own MySQL connection (e. g. when b8's wordlist is stored in another database or on another server, you can put your access data in the config file:

host
The host to connect
user
The username to use
pass
The password to use
db
The database to use

I think in most cases, you can simply pass your existing resource link to b8. But please note that b8 does insist on a resource link (no random shots anymore since version 0.4 ;-) – so if no resource was passed, it will try to set up it's own connection.

MySQL in a multi user environment

I'm pretty sure you don't want to use the functionality described below when you just set up a b8 installation on your home page.

I got requests that the MySQL storage class should be able to set the table's name dynamically without changing the configuration file, so that one b8 installation can be used by multiple users. In particular, this was necessary to embed b8 in an add-on for the Redaxo CMS.

Ever since version 0.4.3, one can pass either just a MySQL-resource link to b8 when using the MySQL storage backend (as described above), or an array, containing one of or both mysqlRes and tableName as keys. If tableName is passed, the value stored here will be used as MySQL's b8 table instead of the one stored in the config file (and no table with the name from the config file will be necessary to construct the storage class).

An example for creating a new b8 instance and passing both a MySQL-link resource and a table name would be:

$b8 = new b8(array('mysqlRes' => $mysqlRes, 'tableName' => $tableName));

As this is quite a special case for usage of b8, the default behaviour to simply pass a MySQL-link resource to b8 described in the example below works anyway of course.

SQLite

When using SQLite, the following values can be set:

sqliteFile
This is the filename of the file which will be used as the database. Defaults to wordlist.db.
The path given here will be handeled in the same way as the dbFile parameter of the BerkeleyDB configuration.
tableName
This is the name of the table containing b8's wordlist. Defaults to b8_wordlist.

As with MySQL, you can also pass an existing SQLite-link resource when creating a new b8 object to use an existing database connecting rather than letting b8 make it's own one.

Configuration of the interface

As said above: if you don't need or want to use the interface, you don't have to configure it (and of course, you also don't have to configure it's work storage backend).

If you want a quick setup, just to have a look at b8, or simply don't want the interface, read on at Using b8.

Basic interface configuration

This is done in config_interface. The following values can be set:

workStorage
Since version 0.4, b8 creates a queryable SQL table with the calculated and splitted data of the real database for the interface. The normal database works fine and very performant in normal operation, but it can't do SQL (b8 is intended to be used with a BerkeleyDB, as said above :-). But we want to have this here, as the whole database had to be parsed each time you do some query like counting all tokens otherwise.
In short: we need an additional storage method. At this time, only a MySQL work storage backend exists, so this is set to mysql by default.
shareConnection
The interface has to set up a link to the storage backend that saves the work database. If this is the same like used for b8's storage class, it's resource link can be passed to it by the interface (e. g. when MySQL is used for both b8's storage class and the interface's work storage class).
As I assume that BerkeleyDB is used for b8, this defaults to FALSE.

Configuration of the interface's work storage backend

This is done in config_storage_work. When using MySQL, the following values can be set:

tableName
The name of the table that will be created for the work database. Defaults to b8_work.
host
The host to connect
user
The username to use
pass
The password to use
db
The database to use

Using b8

Now, that everything is configured, you can start to use b8. A sample script that shows what can be done with the filter exists in example/. The best thing for testing how all this works is to use this script before using b8 in your own scripts.

Creating a new database

This is quite easy. Just set createDB to TRUE in config_storage and do something with b8 one time. E. g. try to classify or learn a text. The database will be created, but the filter won't do anything unless this flag is removed from the config file.

If you use MySQL, this is really the only thing you have to do. When using BerkeleyDB or SQLite, the directory where the new database will be created has to be writable for the server's user. very likely, you have to set full write access to that directory to allow this (chmod 0777 <dir>), so that permissions will be drwxrwxrwx afterwards. This can be done e. g. with your FTP or SSH client.

After the database has been created, remove createDB = TRUE from config_storage or set it to FALSE.

Using b8 in your scripts

This is also quite easy :-) Have a look at the example script in example/ on how to do this. You could e. g. put the following code in your scripts:

# Include the b8 code
require_once "{$_SERVER['DOCUMENT_ROOT']}/b8/b8.php";

# Create a new b8 instance
$b8 = new b8;

# Check if everything worked smoothly
if(!$b8->constructed) {
	echo "<b>example:</b> Could not initialize b8. Truncating.";
	exit;
}

The variable constructed indicates whether b8 was set up properly (TRUE) or not (FALSE). All classes used in b8 have the same functionality. Doing it in this way is crappy. As PHP 4's object model is … but as I want b8 to be compatible with PHP 4, and this was (in my opinion) the only relatively "clean" way to make b8 run with PHP 4 and PHP 5.
Perhaps, the PHP 4 compatibility will be booted out one day and this stuff will be implemented less sucking. So please don't hate me for writing crappy PHP code – I know it better, but I'm forced to do it in this way ;-)

When using MySQL as b8's storage backend, you will connect your database anyway very likely before using b8. Perhaps, you will also do that when using SQLite. A sample for passing the existing MySQL-link resource to b8 would be (to be used in an analogous way with SQLite):

$host = "127.0.0.1";
$user = "user";
$pass = "pass";
$db   = "mydb";

$mysqlRes = mysql_connect($host, $user, $pass);

if(!$mysqlRes)
	die("<b>Example:</b> Could not connect to MySQL (" . mysql_error() . ")<br />\n");

mysql_select_db($db) or die("<b>Example:</b> Could not select database \"$db\". Truncating.<br />\n");

# Include the b8 code
require_once "{$_SERVER['DOCUMENT_ROOT']}/b8/b8.php";

# Create a new b8 instance and pass the MySQL-link resource to b8
$b8 = new b8($mysqlRes);

# Check if everything worked smoothly
if(!$b8->constructed) {
	echo "<b>example:</b> Could not initialize b8. Truncating.";
	exit;
}

After b8 has been set up, it's functions can be used in an object oriented way. E. g. you want to use it's learn() function (see below) to register a ham text, you would use the following code (assuming the variable $b8 contains the b8 instance):

$text = "This is the text to learn";
$b8->learn($text, "ham");

Training b8

Before b8 can decide whether a text is spam or ham, you have to tell it what you consider as spam or ham. At least one learned spam and one learned ham text is needed to calculate anything.
For doing this, the following functions are provided (and can be nicely run with the example script):

learn($text, $category)
This saves the reference text $text in the category $category. This can be either "ham" or "spam" (case-sensitive!).
unlearn($text, $category)
This deletes the reference text $text from the category $category.
This function just exists to delete a text from a category in which is has been accidentally stored before. When you store a ham text in spam and you have a lot words, this will not have much influence on b8 anyway.
Don't delete a spam text from ham after saving it in spam or vice versa, as long you don't have stored it accidentally in the wrong category before!!!
This will break the filter after a time (as the counter for saved ham texts will reach 0 one day, although you have ham tokens stored). Anyway, this makes no sense at all, as the text wasn't stored in spam or ham before and can't be removed from it for this reason.

Classifying texts

This is done by b8's function classify($text). This function takes the text in $text, rates it and returns a float value between 0 and 1.
A value close to 0 is more likely ham and a value close to 1 is more likely spam. What to do with this value is your business ;-) See Tips on Operation below.

The administration interface

b8 comes with a database administration interface. This is found in the interface/ directory.

The following two function groups just concerning b8's database, and therefor, no relational database connection has to be defined in addition to b8's storage backend. Values set in the config files to configure such a connection won't be used using these functions.

Database backup/recovering
Here, you can make a (storage backend independent) backup of your database. This database dump will be stored in a plain text file. You can recover such a backup later by merging it into your existing database (existing token will be updated, non-existing ones will be added) or empty the database before recovering. Both variants can be done here.
Database optimization
This will optimize the internal structure of your database and e. g. delete wasted space that eventually shows up after a lot of database transactions. Anyway, it won't hurt to do this, even if the database is already optimized.

The other functions use an SQL work database, which has to be configured. Why we need this is explained above.

Work database creation / b8 database sync
Here, a work database can be created from b8's current wordlist. The table containing it will be wiped and re-filled completely every time a work directory is created.
When syncing b8's database with the work database, it will first be emptied and then filled with the work database's data. Have a backup when doing this! All data could be lost!
Database info
This will output some information about the number of learned texts and tokens, etc.
Database edit interface
Here the work database can be changed or queried. This can be done by a nice interface or by direct SQL calls.
Important: This is essentially not necessary. It was written for debugging initially. I think the database won't grow so big that you really have to delete anything from it so fast … but it can be done anyway. b8 logs the date when a token was seen for the last time, so one could delete very old tokens that really appeared just once from time to time.
Don't change counts from tokens as long as you really know what you are doing. Note that these statistics are really objective. If your girlfriend's name is the highest rated token, it really appeard in all the spam texts. If you like it or not ;-) so – just let the filter do his work.

Anyway, whatever you do with the work database, it won't affect b8's wordlist, until you sync it with the work database.

Tips on Operation

For the practical use, I advise to give the filter all data availible. E. g. name, email address, homepage, IP address und of course the actual text should be stored in a variable (e. g. with a \n after each block) and then be classified. The learning should also be done with all data availible.
Saving the IP address is probably only meaningful for spam entries, because spammers often use the same IP address multiple times. In principle, you can leave out the IP of ham entries.

You can use b8 e. g. in a guestbook script and let it classify the text before saving it. Everyone has to decide which rating is necessary to classify a text as "spam", but a rating of >= 0.8 seems to be reasonable for me.
The email filters out there mostly use > 0.9 or even > 0.99; but keep in mind that they have way more data to analyze in most of the cases. A guestbook entry may be quite short, especially when it's spam.
If one expects the spam to be in another language that the ham entries or the spams are very short normally, one could also think about a limit of 0.7.

In my opinion, a autolearn function is very handy. I save spam messages (rated with more than 0.7) with a rating higher than 0.8 but less than 0.9 automatically as spam. I don't do this with ham messages in an automated way to prevent the filter from saving a false negative as ham and then classifying and learning all the spam as ham when I'm on holidays ;-)

Anyway, as long as I use b8, I have just a few false negatives (spam messages that were classified as ham) and not one false positive (ham message that was classified as spam) after about 150 learned ham and spam texts.
This results in a sensitivity of about 99 % and a specifity of 100 % for me. I hope, you'll get the same good results :-)

References

  1. A Plan For Spam
  2. Better Bayesian Filtering
  3. Spam Detection
  4. A Statistical Approach to the Spam Problem
  5. Statistical discussion about b8
Tobias Leupold (tobias . leupold at web . de)
http://nasauber.de/