In looking for a way to catch the "new" breed of spam that uses randomly-generated words to foil bayesian filters, I came up with an idea to use a vowel/consonant word ratio.
Basically, a filter would compare the number of vowels in a word with the number of letters in a word to compute a ratio. If the ratio for a message falls within a certain threshold, the message is good. Otherwise, it's spam.
I talked this concept over with Berta last night. I kept thinking that maybe it would be simpler just to run a spell check on the message. The more words that fail, the more likely a message is to be spam.
So I put together a program to test my theories. I discovered some very weird things.
I fed the entire texts of three books from Project Gutenberg into the processor to compute the average vowel/consonant ratio and the standard deviation in this ratio. Here are the results:
Book | v/c | ó(v/c) | time |
---|---|---|---|
Leaves of Grass | .38369 | .16102 | 1100ms |
Tom Sawyer | .38641 | .16278 | 110ms |
Don Quixote | .39704 | .15701 | 69ms |
I think it's weird that they all have numbers within .01 tolerance, but that's a good sign, right?
But the worst bit of news is the time. It takes a long time to process each book. Granted, Leaves of Grass has 122562 words (693kb), but it takes over a second to scan it.
Even more interesting in this experiment is using an actual spell checker to test each book using a ratio of words spelled correctly to total word count:
Book | spelling | time |
---|---|---|
Leaves of Grass | .97 | 720ms |
Tom Sawyer | .98 | 60ms |
Don Quixote | .97 | 40ms |
Notice that the time to do a spell check (just whether the word is good or bad) on each word is less than calculating the vowel/consonant ratio. This is probably because you can short-circuit a spell check using a binary tree. If a word "zxpqmnsol" is spell-checked, it shorts out at the second letter because there are no words that start "zx". On the other hand, the v/c ratio has to look at every letter of every word.
The only problem with the spell check is that it takes an additional 40ms to load the dictionary, which isn't shown in the table above. After the dictionary loads, everything is faster than the v/c check, but the nature of the application is that it will run only once for each message, so the dictionary will load each time.
For small messages, this is a huge performance hit. The Don Quixote excerpt, for example, doubles in scan duration, pushing it over the time needed for a v/c ratio check. For big messages, the additional loading time is a drop in the bucket compared to the relative accuracy.
The question with all of this, though, is how to determine which words are just random letters and which ones were accidentally spelled wrong. You wouldn't want to reject a message as spam just because the person who composed it couldn't spell or couldn't type.
So I added a features that gets word suggestions for words that aren't in the dictionary. The number of suggestions is inverted so that smaller numbers mean that the word was probably just misspelled, and larger numbers mean that the word was probably just random letters.
Unfortunately, the suggestion engine takes a significantly longer time to process. In the case of Tom Sawyer, it took almost 5 seconds to return a rating of .279, which should be considered in the good range. Don Quixote, the shortest of the bunch, took just under 7 seconds to return a rating of .460, which is in the acceptible range, but not very good. I waited almost 2 minutes for Leaves of Grass to finish, and got a rating of .33, which it a good rating even if it took entirely took long. This testing method is somewhat impractical.
So I'm not sure what I'll do yet. The spell checker requires another dataset (the dictionary, ~300kb) be installed on the server. I think that if I can avoid cluttering the server with more files, I'll be better off. But it is just one file, and it's measurably faster than the v/c check for large messages. I wonder how many people really send book-sized messages, though.
Maybe I'll implement both ways and let the administrator configure which one is turned on. Yeah, that sounds like a plan for more work, doesn't it?