I had an online conversation yesterday with an acquaintence of mine. She was alarmed to have found that her entire site's content had been republished by some other site!

Apparently, their site had been sucking on her site's RSS file for quite some time, and managed to download a sizeable chunk of data, which they subsequently republished with their own ads strewn about. And she's not the only one by a long shot.

If you're not aware of this phenomenon, it's generally referred to as "Splogging", for "spam blogging". The idea is usually to re-blog content form other people's blogs to gain emphasis on their popular terms for your splog site.

For example, if I wanted my site to be a popular search result for "student loans", first I would install a blog on my server. I would then use some software to aggregate, say, the Technorati feed for posts tagged with "student loans", which gives me a rich bed of content to start populating my site. Using some some dodgy plugins sold by less-than-respectable authors, I can even have WordPress do all of this work for me.

Then, I sprinkle a few links onto the splog that point to my money-making page, and voila! Instant PageRank!

The bottom line for bloggers is that your popular content will be stolen and used to fuel a link farm that profits someone else. How nice. So what do you do to combat it? I have a suggestion or two.

If you're inclined to modify your server's configuration a little bit, there are actually a few things that you can do that are much more efficient than what I'm about to suggest. Check out Val's rant for a list of those things. It usually involves modifying .htaccess, which might be available to you, and is often tedious to keep on top of unless you're really vigilant (read: staring at your logs all the time).

An easy alternative to messing with your config files is using this new plugin I've written, called AntiLeech.

What does AntiLeech do? AntiLeech does not prevent the splogger bots from accessing your site. No, it does better than that. It produces a fake set of content especially for them that includes links back to your site (and mine, too, ok?) and sends it only to them. When they steal this content, it appears online just like normal, except now you've turned the tables on them. You're actually using the sploggers to promote your own site.

AntiLeech can detect a splogger bot using its User-Agent string (an identifier that some bots send when they are collecting data), or by IP address. You can enter a User-Agent or an IP address into the Options panel of your WordPress blog. When a visitor with a qualifying (any checked option on the options page) User-Agent or IP address visites your site, they will see only the generated content. They will see it in your page layout and in your feeds. Anywhere you're normally outputting content, that's where the fake content will appear to them.

Regular users whose browsers do not match these strings will see your normal content. RSS aggregators should be able to display your content normally, too.

AntiLeech also uses a trick to detect when new User-Agents have collected and displayed your posts. You may see a little "AntiLeech" graphic in your feed output. This graphic helps AntiLeech collect User-Agents that you might want to block. AntiLeech will tell you on what page it first saw the User-Agent, if it can, to help you better make the decision to block that User-Agent or not.

You can turn off this option if you don't want the image to appear in your feeds, but then AntiLeech won't be able to detect new User-Agents for you. The image is pretty small and unobtrusive, and doesn't link to anywhere.

In addition to all of that, AntiLeech will produce a robots.txt output from the User-Agents that you've specified in the options page, assuming you don't already have one. In WordPress 2.1 there is a hook for this already, but this feature of AntiLeech miraculously still works in WordPress 2.0, too!

Of course, I haven't had AntiLeech in production very long, so I would like some feedback on your use of it, especially if you find it useful.

Let's get these sploggers!

Ok, now I must get food. Sorry if this reads a little light-headed.

Comments

Comment by Owen on .
Owen
valerie: Well, I would recommend to anyone who knows how to use their .htaccess and understand what's going on to do that instead, primarily because it's more efficient - it doesn't hit your database or run parts of WordPress at all, and so the sploggers aren't stealing your CPU time either. But if you aren't into maintaining all of that or don't understand it or are scared to mess with it, then go ahead and install the plugin. It doesn't do exactly what I wrote for you the other day (no Bitacle images) primarily because it's not targeted only at Bitacle, but at any splogger.
Comment by Owen on .
Owen
Angsuman: There are three points to be made about your "fundamental problem". Number 1: Bitacle (the site that spawned this issue) does use a User-Agent. Don't ask me why. Number 2: You can also use the plugin to block by IP, which is harder to track down, but the plugin still makes the blocking easier. Number 3: There may be sites or User-Agents that you specifically know of that you want to receive a different version (say, truncated) of your content. This plugin will soon allow you to do just that. Also, I'm going with Fake Rake on the Google ranking issue - Google shouldn't penalize you when some splogger (over whose site you have no responsibility) links to you. This is the sentiment I've received when making this request on Google's search ranking webmaster group. Jessica: Yes, I do believe Bitacle is a certain kind of splog. The steal content for their own financial gain. If they're not technically a "splog" then they deserve at least as much ire, if not more for being so blatant about it.
Comment by Owen on .
Owen
The latest version of the plugin (currently 1.7) allows you to set your own fake content directly from the admin page. Here are the built-in messages, which are rotated through your feeds so that 10 unique fake post contents are always available: * Is this site attempting to steal $user's content? It seems like it. * $user never authorized this site to copy this site content, but it looks like they did it anyway. * The copyright on $user's work doesn't extend to the owner of this site. * If you want a copy of $user's content, you really ought to ask first. * If you see advertising on this page, $user - the author of this content - probably isn't seeing any of that revenue. * Did this site obtain $user's permission to re-publish this content? No! * Here's just another instance of someone trying to make a buck by taking $user's content without permission. * Where is justice in the world when this site steals $user's original, copyrighted works? * Aren't you sick of this site stealing $user's content for their own use? * But wait, who does this content really belong to? Not this site, but $user, from whom they stole it!
Comment by Owen on .
Owen
Xial: That's just the one that was listed on the Splogging entry on Answers. I can't vouch for their efficacy. I suppose they've got the same problems as the open relay databases. Patrick: I had considered an export option, but got lazy. Also, remember that the point of the splogger is not always to advertise on the site that has a copy of your content, but to use your content to boost the juice on the other sites that the splog links to. Jane: I could probably make the plugin do that. You're certainly a lot more gracious than I would be. I will sleep at some point, though, so gimme a little more development time. ;) Montoya: I think that the Ordered List Feedburner plugin would interefere, since it would redirect everything that's not Feedburner to a Feedburner URL. However, having looked at the code for that plugin, there is no freaking way I would ever run that on my server. From what I can tell, I can reset your site's feedburner redirect to whatever I want with about 2 minutes of from-scratch coding. In fact, if you want me to write code into this plugin to use it with Feedburner that does what the Ordered List plugin does, I'll make that a priority for the next release. This is a clear example of the need for plugin peer-review. Yikes.
Comment by Ben on .
Ben
Thanks for the plugin. It's really good :) I'd be interested to know what the issue is with the ordered list feedburner plugin, and would also be interested in somehting similar added to antileech so that I can ditch that plugin entirely. I get the impression that OL feedburner plugin is widely used so it may be worth informing the author of the problem since I doubt he is aware of it.
Comment by Turkay on .
Turkay
Dear Sir , My Question is ; Publisher sets his own RSS himself. and he decided how many part can be read via RSS. if I use this FEED and If I pubish the content which was directly received via RSS ( without any replacement ) and if I add at the end of the content , that where it was taken from with a back link... is it illegal ? Regards, Turkay
Comment by Owen on .
Owen
Turkay: Yes, under most circumstances, that is illegal. Unless the author of the RSS content has granted you permission to re-publish that content, then you are violating his copyright on that content. Consider that you would not buy or borrow a book from the library and then produce your own copies of that book for distribution. Regardless of the means of delivery, or the cost of that content, copyright is attibuted only to the authors of the content. If you are not granted the right of publication by the copyright author, you would break the law in publishing it. If you're looking for a less-restrictive law, you might consider publishing content you create using a Creative Commons license, with which you still retain the copyright, but have granted certain republication permissions by default.
Comment by Andy Beard on .
Andy Beard
Another good alternative is http://www.anticrawl.com/ which will work on all websites, and only allow Google, MSN and Yahoo in. I haven't looked at the code for a while, but last time I did, it was blocking Alexa, because of the Wayback Machine. Archive.org by some is looked on as the biggest splog site on the internet, though I personally find it facinating. I try to be realistic about my content. If you are using a blogging platform for publishing your content, you more than likely are quite happy for people to quote you with a link back to your site, and a trackback. If you publish an RSS feed ad ping it out to agregators, the content is going to get used, and you are effectively asking for it to be used (with reasonable attribution) Many authority sites are just rehashed RSS feeds from AP, plus a mixture of syndicated articles. Blogs that feature a mixture of 3rd party articles and headlines from RSS feeds are looked on as splogs. The little guys (quite often) are victimised as being sploggers because they can't do it as well as the international press. Having your content picked up on PR5 and PR6 "splogs" doesn't hurt your own search engine ratings any more than having the content appearing on a service like Technorati.
Comment by Owen on .
Owen
Andy: That Anticrawl site looks like something right out of the MLM sales handbook! The latest version (1.5) of the plugin now supports FeedBurner redirection (like the Ordered List plugin, but a tad more secure), a choice of output formats (generated posts, truncated original posts, or a custom block of text), and a couple more toggle options for output.
Comment by Plain Jane mom on .
Plain Jane mom
That looks really interesting! I have one question, and it will demonstrate my ignorance of many things. I don't think I want them linking to me. I have no doubt that eventually google and others will catch on to bitacle, and at that point I don't want to be penalized by being linked to by them. "Bad neighborhood" and such. Does that make sense, or is that irrelevant in this case? Jane
Comment by Owen on .
Owen
There is already a link to the original post in most splogs, so are we all getting dinged by Google just by having our sites stolen? For that matter, are all of the sites that we would normally link to in our stolen posts getting dinged? In any case, if you fear it, I've updated the plugin with an option to disable the links, so that it only outputs the plain text of the generated post with all of the tags stripped from it.
Comment by Patrick Havens on .
Patrick Havens
Interesting I had heard meantion of this many times, but I hadn't seen it done so batently. Usually I expect more determination where it came from... and there isn't any. Mind, the stuff I post I can't see stealing... But I'm making note of the plugin anyways. Might I suggest an easy way to export, and import lists of troublesome bots?
Comment by Patrick Havens on .
Patrick Havens
As a side, I went and did some searches if I could find any "leech-like" sites and though Topix had a number... they where more clearly an aggregator... but I also ran acrossed some weird ones: No ads on the site, but no link to my site either... http://www.discover-about.com/2006/06/22/interesting-links/ The other where like run together short quotes of me... they show up in my trackbacks, but they have very little to nil content... and it even looks like they remove returns. Was weird... http://mariesshabalfas.com/mortgage_news/college-loan-refinance (No ADs) http://cash-for-structured-settlement.cashblog.org/9598/ (Does it have Ads?)
Comment by Andy Beard on .
Andy Beard
>> Andy: That Anticrawl site looks like something right out of the MLM sales handbook! Robert does sell other scripts, but not very often and doesn't bombard you with emails about other products. It is mainly used to script update notification. Anticrawl protects your content with a captcha which appears after viewing a number of pages, unless you are Google Yahoo or MSN. He is certainly not MLM, but many of his products are intended for use by people making a living on the internet.
Comment by Angsuman Chakraborty on .
Angsuman Chakraborty
There is a fundamental problem with this idea. User Agents can be easily faked. It is trivial for any splogger to use say IE or Firefox user agent. Also as others have pointed out before getting links from splog sites doesn't sit well with search engines like Google. They consider where you are getting the links from, your neighborhood in essence.
Comment by Fake Rake on .
Fake Rake
Google has said that who links to you doesn't have any effect; if it did, anyone could punish you just by linking to you. Now, your site linking to "bad" sites, that's a whole other issue. But like people have said, splogs usually keep a link in to the original site they stole the content from, so you already have links from them. Google isn't going to penalize you for that.

Sorry, commenting on this post is disabled.