Followup on Spam Filtering

I figured that several readers are also bloggers in their own right, and might be interested in some information that I’ve gathered about spam and my efforts to block it.

This blog, which is not a terribly popular one, gets a substantial amount of comment spam. For example, here’s the amount of spam that was received for the last few months:

Dec2010: 5,028
Jan2011: 6,544
Feb2011: 4,712
Mar2011: 5,596

Compare that to the 25-30 legitimate comments made monthly, and you see that the ratio is extremely skewed in favor of spam. Since this blog was founded in 2008, 53,881 spams have been received, compared to 854 total legitimate messages.

Ideally, there would be no comment spam. Since this is not possible, I want to reduce spam by the maximum amount possible, inconvenience users as little as possible, and keep the spam queue in the WordPress administrative interface as empty as I can.

Now, WordPress comes with an outstanding spam filter called Akismet. When activated, all incoming comments are sent to Akismet for a spam/not-spam review. Since the service is centralized, they’re able to accumulate a huge amount of data about spammy and legitimate messages, adapt to changing spam patterns, and do remarkably well (99.96% according to my calculations) at detecting spam and allowing legitimate messages to pass. If it misses spam, or mistakenly flags legitimate mail as spam, I can override the Akismet decision (and that override is sent to Akismet so it can adapt).

Messages flagged as spam by Akismet go into the spam queue for my review. Unfortunately, this means that more than 150 spams a day get shunted there. Reviewing these messages is tedious and time-consuming. What if I could block the spam from even being submitted, thus reducing the amount of spam that I need to wade through?

Since all WordPress blogs have the same comments.php file, spammers don’t even need to fill in the normal comments form on the website: they can submit their spam directly to the comments.php file with the appropriate fields already filled in. Of course, since this is all done automatically by software, a slight change to the comments.php file will result in the spambots being unable to submit messages. Enter NoSpamNX, a very handy plugin that makes these changes that breaks spambots but doesn’t affect humans. Specifically, it adds certain fields to the human-readable contact form that are filled in with a randomly-generated bunch of text (to avoid the spammers adapting, it changes these random values every 24 hours).

If a comment does not include these hidden fields with that day’s random text, that means that the comment was not submitted through the ordinary human-readable form, and therefore must be spam. One can elect to then mark the message as spam, or simply delete it outright.

This simple plugin has blocked 37,775 spams since I installed it in June 2010. During that same period, a total of 39,113 spams were submitted to my site. This means that NoSpamNX alone would have blocked about 96.6% of spam. Not bad, particularly for something that does not burden legitimate commenters with any additional steps like CAPTCHAs.

In my particular case, I like contributing spam messages to Akismet since it improves their statistics, so I elected to have NoSpamNX simply mark messages as spam rather than deleting them (the deletion would occur before the messages get submitted to Akismet). Thus, my spam queue had lots of messages for me to review. I needed something more, something that would provide a second opinion to Akismet and NoSpamNX.

In my December 14th post, I mentioned that I was testing out a plugin called Conditional CAPTCHA. This one is particularly useful: it waits for messages to get reviewed by existing spam filters such as Akismet. If Akismet says the message is legitimate, Conditional CAPTCHA does nothing, and the message is posted immediately. However, if the message is flagged as spam, then Conditional CAPTCHA presents a reCAPTCHA. If the CAPTCHA is solved incorrectly or no attempt to solve it is made within 10 minutes, the message is silently deleted and not added to the spam queue. If the CAPTCHA is solved correctly, the message is then placed into the moderation queue (I’m a bit suspicious, as it was marked as spam, so I want to review it prior to it being posted).

Using Conditional CAPTCHA means that the vast majority of legitimate commenters are not inconvenienced by always facing a CAPTCHA. Only comments flagged as spam are presented with such a challenge.

So far, Conditional CAPTCHA has stopped 18,589 spams since it was installed, essentially 100% of the spam submitted to this site. There have been exactly four messages that were flagged as spam and resulted in the CAPTCHA being solved correctly. All of these have been spam, and never made it out of the moderation queue.

In my particular case, NoSpamNX is a bit redundant: I use it simply to keep a measure of how many spammers submit spam directly to the comments.php file versus how many submit comments using the human-readable form.

In conclusion, if you are a WordPress blogger and are inundated with spam, both on your site and in your spam queue, I heartily recommend using both Akismet (which you should already be using) and Conditional CAPTCHA. Doing so should reduce your spam to practically nothing.

If other bloggers out there have some statistics on the spam they receive, what they use to combat it, and how effective those measures are, I would be quite interested in hearing about it.

I Got Nothing

Sorry folks. Nothing much has been happening recently. I haven’t been to the range in months, haven’t taken new shooters out in a while longer, have been about a month behind the times when it comes to gun-related news, have fallen behind in reading other blogs, etc.

I’m alive (at least for now; I’m going to be skiing all next week), excited about having gotten into graduate school, and generally getting along fine.

As an aside, if you haven’t played the video games Mass Effect and Mass Effect 2, you’re missing out. I was a bit skeptical of a third-person shooter/RPG, but I was wrong. They’ve seriously been the most-bang-for-the-buck entertainment that I’ve had in years (since Star Wars: Knights of the Old Republic which, interestingly enough, is made by the same company as Mass Effect). Tons of replay value, too.

“Proprietary” as a positive selling point?

Why is it that companies use the word “proprietary” as a selling point?

For example, a big shipping company has a partnership with the US Postal Service to provide various package-management and expedited-delivery services1 and, as part of the list of things they claim make their service better, they mention “proprietary software”.

Other companies mention proprietary formulas, methods, etc.

Do people usually think of this as a positive thing? Maybe it’s my background in science (essentially all discoveries go through peer review and are published for all to see) and being a bit of a free software geek, but I don’t see proprietary things as a good thing.

  1. Mostly by moving the package through their own network to the post office closest to the recipient, then handing it over to the post office for last-mile delivery. Why this is better than sending something entirely by UPS/FedEx or entirely through the post office, I don’t know. []

“Enterprise-class”, my ass

The university has licensed a particular brand of anti-virus software for all students, faculty, staff, etc. The department I do IT work for (my day job) has a central console that allows administrators to monitor the status of the anti-virus software on all the computers on the network.

I know it well, as I was the one who set it up.

Unfortunately, it’s a piece of crap and is two major versions out of date (the university only got the newer versions a short while ago). It’s also not going to be supported soon, so we had to upgrade it.

Most end-user software seems to handle in-place updates really well. Mozilla Firefox, Windows, even Acrobat Reader update really easily. Certain other software, like Apache, MySQL, and other such things also update reasonably smoothly.

This anti-virus console is not one of those things.

I honestly couldn’t think of something that’s more of a pain in the ass to upgrade.

It turned out to be faster and easier to simply install the newer console on a different server, configure it by hand, and then manually re-install the client software on the 200 or so desktop systems (again, by hand) than it was to try to upgrade the existing console.

The new one’s quite a bit better than the old one, but there’s still no built-in “upgrade in-place” feature, so in a few years someone’s (hopefully I’ll be in grad school by then) going to have upgrade to the next version. That’ll suck; a lot of the configuration is stored in some unknown way, and not accessible to the GUI or the configuration files. If even the tiniest thing gets out of whack (which happens on occasion), diagnosing the problem (not to mention fixing it) is a massive pain in the ass.

Compare that to Windows Server Update Services — a simple Group Policy change on the clients1 and the clients get all their Windows Updates from the WSUS server, which can manage which updates are to be deployed to clients. Quick, simple, and scalable, all through an intuitive GUI.

Say what you will about Microsoft, but they have enterprise-class management down pat. This anti-virus company, though…not so much…

  1. We don’t have an Active Directory, so we can’t push it from a central system, but have to do the changes by hand. There’s a lot of inertia and legacy systems here. Oh well. []

Musings on RadioShack

I may be only 28, but I remember when RadioShack was a place of wonder and excitement in the pre-web days. Back then, cellphones had yet to be in widespread use, and one could buy any number of electronic components from employees who were also hobbyists and geeks.

Now, it’s a glorified mall cellphone kiosk with a few token items for hobbyists, but those are tucked away in the back, seemingly out of shame.

As a scientist and a tinkerer, I enjoy getting data on things that I’m working on. As an example, if I was building a solar array that would charge batteries, I’d want to know the current voltage on the batteries in the array (to determine state-of-charge) and the current from the panels to the charge controller and from the batteries to the load.

Going with this example, I was in RadioShack yesterday with a friend (she needed a new coin-cell battery for her calculator) and asked if they had panel voltmeters and ammeters (see here for an example) for such a system.

One of the employees thought about it, and said “No, I’m afraid we don’t carry those. Sorry.” Although not the answer I was looking for, he was honest and helpful, which I appreciate.

The other employee said, “Why do you want that? Why not just use one of the multimeters we have here?”, waving at the back of the store.

Me: “I already have three multimeters, and they all max out at 10 amps, and they can only support such currents for 30 seconds with a few minutes to cool down. I’d like something that can handle 20-50 amps indefinitely. Panel meters don’t require batteries, which is a major benefit. Also, I’d like something a bit more elegant to put into a display console.”

Employee: “Why not use one of the clamp-type multimeters we have to measure larger currents?”

Me: “The ones you have here only work on AC, not DC, which is what I’ll be working with.”

Employee: “Why not power your multimeters with a small solar cell or power them from the source and mount them in your console?”

Me, suspecting this conversation has started going downhill: “Because the multimeters are not rated for the currents I’ll need them for, a solar panel would provide intermittent power by not working at night [where knowing the state of charge is important], and the source voltage is very different from what the meter requires, as the meter runs on AA batteries. Panel meters are much more appropriate, and look quite a bit nicer.”

Employee: “Why would you need to deal with such currents at all? The biggest solar panel that RadioShack sells is a 5 watt panel that sits on your car dashboard that keeps your car battery topped off.”

Me: “I have no use for such a panel at all; my project would involve an array of big panels that would charge a battery bank that could power a small house. I’d like a permanently-wired, nice looking console that would have some meters in it so I could know, at a glance, the current state of the battery bank.”

Employee: [blank look]

Me: “Nevermind. Have a nice day.”

I have no problem with an expert (or even an enthusiastic amateur) discussing project requirements with me. Indeed, they may have a better idea of setting up such a system than I, which would be very helpful.

However, I rather dislike it when someone not only makes inappropriate suggestions, but argues about basic design goals (e.g. I want a nice-looking monitoring console, not a kludge of multimeters and wires running everywhere). Yes, I could put some shunts into the circuit and measure high currents safely with a standard multimeter; such a setup would be great for testing and bench work, but not for a final product.

RadioShack certainly isn’t what it used to be.

Fortunately, the internet allows me to order the meters I want for less than $10 each, and have them shipped to me from Thailand in less than a week. I also don’t need to interact with people like this RadioShack employee.

Frustration

Google evidently has two separate account namespaces:

  • Google Accounts
  • Google Apps account

Google Accounts can be, but are not┬ánecessarily, a Google Mail/Gmail account. One can have a Google Account without having a Gmail account (e.g. [email protected]) and can use such an account for accessing services like Google Reader, Google Docs, etc. I created such an account years ago for my personal email address.

Google Apps accounts are accounts associated with Google Apps, which are separate from regular Google Accounts. Google Apps provides email service for my personal domain.

Unfortunately, this means that both my Google Account and Google Apps account had the same username, which lead to considerable confusion.

I’m just now trying to get this all straightened out by only using Google Apps for email and XMPP chat and migrating all my other services (like Google Reader, Google Voice, etc.) to a single Google Account. This is exceedingly frustrating.

Musings on Telephones

Sebastian’s treatise on the drawbacks of telephones struck a nerve with me; I too tend to be rather taciturn, and so prefer communications by email or IM (mostly email, as I like the fact that an immediate response is often not required, so one can think out one’s response a bit more).

However, when I do need to use the telephone, I prefer that it doesn’t suck. Cellphones are mediocre at best, what with the extensive voice compression and signal processing they utilize. Yes, they can be incredibly convenient1, but the lower quality is a big tradeoff.

Fortunately, my work happens to have really nice Cisco IP phones that have outstanding call quality. It’s rather nice to be able to speak to someone and be mutually intelligible.

I’m neither an audiophile nor a luddite, but it rather annoys me to have audible communications go from “so clear you can hear a pin drop” to “can you hear me now?”2 in just a few years. If it’s possible to have high-definition TV broadcast over the air, radio signals beamed in from space, and high-quality movies streamed over the internet, is it too much to ask that cellphone provide a similar level of quality as landline phones?

I’d love to get a landline phone at home, but landline phones plans are absurdly over-priced. They still charge for long-distance service? What the hell? I can use Skype/Google Talk/SIP to call India and have a crystal-clear audio and video chat all day at no cost3, yet wireline phones charge per-minute rates to call Phoenix from Tucson? Local phone service from Qwest is about $13/month, with no features (e.g. no caller-ID, no voicemail, etc.), but with the absurd amount of taxes and fees they tack on, it ends up being closer to $30/month. Completely not worth it. I wonder if the phone companies ever consider why they’re losing business to mobile devices?

  1. Though I must say it was quite a relief to be without cellphones for a few weeks whilst on my honeymoon. []
  2. Yes, I know they’re marketing slogans and I’m taking the latter out of context. Deal with it. []
  3. Or use a VoIP phone for mere pennies a minute. []

On Changing Mail Servers

My personal, non-blog-related domain has used Google Apps for email for years. In essence, one gets all the benefits of Google Mail (excellent spam filtering, IMAP/POP/SMTP, huge amount of storage, reliable infrastructure, etc.), but for one’s own domain. Very handy.

One of the advantages of having one’s own domain is that one is not bound to a specific email provider; one can change the back-end provider relatively easily and with essentially no disruption. Over the last 11 years, my personal domain has had probably half a dozen providers handling email, with Google Apps providing service for about the last four years.

While I’ve been quite satisfied with Google Apps1, I always like to check out alternatives at intervals, much like I do with car insurance.

Fortunately, Google makes moving away from their services extremely easy: it’s trivial to move mail to the new server by IMAP, and a few simple changes to my DNS records now direct mail to the new server. Everything was done with about 5 minutes of work.

There’s two quirks with moving away from Google Mail, though.

The first is that Google Mail is primarily web-based, and offers IMAP/POP service as a feature, while the new service is primarily IMAP/POP with webmail as a feature, and so their webmail is pretty basic.

The second is that Google has excellent spam filtering, mostly based on the input of its brazillions of users marking messages as spam or not spam. The filtering takes place on the server side, which keeps spam levels in one’s inbox to a minimum regardless of whether one uses webmail or IMAP/POP. Marking messages as spam or not spam is trivial and totally in-band (click a button on the webmail interface, move the message to an IMAP folder if using a client).

The new provider offers some server-side filtering, but it’s nowhere near as good as Google’s, and using the server-side filtering requires identifying spam or non-spam via out-of-bound methods (clicking a link in the email, which opens a browser window) which is a bit tedious. I can do better filtering on the client side, but that means that accessing my email with the webmail interface (which doesn’t have the filtering ability of my mail client) results in a massive amount of spam polluting the folder.

Slightly frustrating, to say the least.

I’ll give this other provider a few more days to see if their spam filtering can adapt to deal with the onslaught, but for my purposes (mostly webmail, with occasional IMAP use), Google Apps’ service appears to be better. However, in the event that Google turns to the dark side, it’s good to know there’s options.

  1. Although there are a few quirks when using IMAP due to the fact that Gmail uses “labels” instead of “folders”, they’re minor and easily adapted to. []