I had an online conversation yesterday with an acquaintence of mine. She was alarmed to have found that her entire site's content had been republished by some other site!
Apparently, their site had been sucking on her site's RSS file for quite some time, and managed to download a sizeable chunk of data, which they subsequently republished with their own ads strewn about. And she's not the only one by a long shot.
If you're not aware of this phenomenon, it's generally referred to as "Splogging", for "spam blogging". The idea is usually to re-blog content form other people's blogs to gain emphasis on their popular terms for your splog site.
For example, if I wanted my site to be a popular search result for "student loans", first I would install a blog on my server. I would then use some software to aggregate, say, the Technorati feed for posts tagged with "student loans", which gives me a rich bed of content to start populating my site. Using some some dodgy plugins sold by less-than-respectable authors, I can even have WordPress do all of this work for me.
Then, I sprinkle a few links onto the splog that point to my money-making page, and voila! Instant PageRank!
The bottom line for bloggers is that your popular content will be stolen and used to fuel a link farm that profits someone else. How nice. So what do you do to combat it? I have a suggestion or two.If you're inclined to modify your server's configuration a little bit, there are actually a few things that you can do that are much more efficient than what I'm about to suggest. Check out Val's rant for a list of those things. It usually involves modifying .htaccess, which might be available to you, and is often tedious to keep on top of unless you're really vigilant (read: staring at your logs all the time).
An easy alternative to messing with your config files is using this new plugin I've written, called AntiLeech.
What does AntiLeech do? AntiLeech does not prevent the splogger bots from accessing your site. No, it does better than that. It produces a fake set of content especially for them that includes links back to your site (and mine, too, ok?) and sends it only to them. When they steal this content, it appears online just like normal, except now you've turned the tables on them. You're actually using the sploggers to promote your own site.
AntiLeech can detect a splogger bot using its User-Agent string (an identifier that some bots send when they are collecting data), or by IP address. You can enter a User-Agent or an IP address into the Options panel of your WordPress blog. When a visitor with a qualifying (any checked option on the options page) User-Agent or IP address visites your site, they will see only the generated content. They will see it in your page layout and in your feeds. Anywhere you're normally outputting content, that's where the fake content will appear to them.
Regular users whose browsers do not match these strings will see your normal content. RSS aggregators should be able to display your content normally, too.
AntiLeech also uses a trick to detect when new User-Agents have collected and displayed your posts. You may see a little "AntiLeech" graphic in your feed output. This graphic helps AntiLeech collect User-Agents that you might want to block. AntiLeech will tell you on what page it first saw the User-Agent, if it can, to help you better make the decision to block that User-Agent or not.
You can turn off this option if you don't want the image to appear in your feeds, but then AntiLeech won't be able to detect new User-Agents for you. The image is pretty small and unobtrusive, and doesn't link to anywhere.
In addition to all of that, AntiLeech will produce a robots.txt output from the User-Agents that you've specified in the options page, assuming you don't already have one. In WordPress 2.1 there is a hook for this already, but this feature of AntiLeech miraculously still works in WordPress 2.0, too!
Of course, I haven't had AntiLeech in production very long, so I would like some feedback on your use of it, especially if you find it useful.
Let's get these sploggers!
Ok, now I must get food. Sorry if this reads a little light-headed.
Doesn't Google ding sites that are spamvertised? It seems like this would be like shooting yourself in the foot by getting all these spammy links coming back to you.
Yeah, i think i agree with Matt. It might take a while but Google will most probably think you're a spammer too.
wow thanks owen, I'll have to try your script out,and I'll spread the word! Thanks!
That looks really interesting! I have one question, and it will demonstrate my ignorance of many things. I don't think I want them linking to me. I have no doubt that eventually google and others will catch on to bitacle, and at that point I don't want to be penalized by being linked to by them. "Bad neighborhood" and such.
Does that make sense, or is that irrelevant in this case?
Jane
There is already a link to the original post in most splogs, so are we all getting dinged by Google just by having our sites stolen? For that matter, are all of the sites that we would normally link to in our stolen posts getting dinged?
In any case, if you fear it, I've updated the plugin with an option to disable the links, so that it only outputs the plain text of the generated post with all of the tags stripped from it.
is there anyway this could work akin to akismet, in that a centralized service could keep track of the splogger list for us?
You might try Splog Spot, which publishes a list of sploggers. It will take some bit of coding to incorporate their API into the detection scheme.
thanks owen,
I have a problem with the plugin :(
in the settings page there's a fatal error:
"Call to undefined function: wp_nonce_field() in /home/gidibao/public_html/wp-content/plugins/antileech.php on line 296"
Can you help me? thx
Have a nice week-end :)
Yes, I can help you! Upgrade WordPress to 2.0.4.
Alternatively, you could avoid this plugin and also expose your site to direct attack via vulnerabilities in the prior versions.
It's really a good idea to upgrade. :)
Thanks Sir!
I will upgrade to 2.0.4 :oops:
Regarding your comment on using SplogSpot, I'd avoid them like the plague, since they're easily poisoned, and, are confusing a weblog that has original content as far back as 2001, as a splog.
That is, they have me listed, and won't remove that listing.
Interesting I had heard meantion of this many times, but I hadn't seen it done so batently. Usually I expect more determination where it came from... and there isn't any.
Mind, the stuff I post I can't see stealing... But I'm making note of the plugin anyways.
Might I suggest an easy way to export, and import lists of troublesome bots?
As a side, I went and did some searches if I could find any "leech-like" sites and though Topix had a number... they where more clearly an aggregator... but I also ran acrossed some weird ones:
No ads on the site, but no link to my site either...
http://www.discover-about.com/2006/06/22/interesting-links/
The other where like run together short quotes of me... they show up in my trackbacks, but they have very little to nil content... and it even looks like they remove returns. Was weird...
http://mariesshabalfas.com/mortgage_news/college-loan-refinance (No ADs)
http://cash-for-structured-settlement.cashblog.org/9598/ (Does it have Ads?)
Owen,
How about making it an option to just send a short summary to certain sites, like bitacle? I wouldn't care at all if they wanted to republish a link-free, image-free version of the first 200 chars or so of my posts.
Jane
This doesn't sound like it would work with Feedburner and the OL_Feedburner_Redirect plugin. Would it interfere?
Xial: That's just the one that was listed on the Splogging entry on Answers. I can't vouch for their efficacy. I suppose they've got the same problems as the open relay databases.
Patrick: I had considered an export option, but got lazy. Also, remember that the point of the splogger is not always to advertise on the site that has a copy of your content, but to use your content to boost the juice on the other sites that the splog links to.
Jane: I could probably make the plugin do that. You're certainly a lot more gracious than I would be. I will sleep at some point, though, so gimme a little more development time. ;)
Montoya: I think that the Ordered List Feedburner plugin would interefere, since it would redirect everything that's not Feedburner to a Feedburner URL. However, having looked at the code for that plugin, there is no freaking way I would ever run that on my server. From what I can tell, I can reset your site's feedburner redirect to whatever I want with about 2 minutes of from-scratch coding.
In fact, if you want me to write code into this plugin to use it with Feedburner that does what the Ordered List plugin does, I'll make that a priority for the next release. This is a clear example of the need for plugin peer-review. Yikes.
No, Montoya, it works with Feedburner.
Thanks for the plugin, Owen!
Deanna: Just to be clear, it works fine with Feedburner itself, but may not work fine with the Ordered List plugin that I linked (and Montoya mentioned) above.
Ringmaster++
There is a fundamental problem with this idea. User Agents can be easily faked. It is trivial for any splogger to use say IE or Firefox user agent.
Also as others have pointed out before getting links from splog sites doesn't sit well with search engines like Google. They consider where you are getting the links from, your neighborhood in essence.
Google has said that who links to you doesn't have any effect; if it did, anyone could punish you just by linking to you. Now, your site linking to "bad" sites, that's a whole other issue. But like people have said, splogs usually keep a link in to the original site they stole the content from, so you already have links from them. Google isn't going to penalize you for that.
Is Bitacle a splog? They have all my content over there. Sorry if this question seems like I should know the answer but I don't.
Angsuman: There are three points to be made about your "fundamental problem". Number 1: Bitacle (the site that spawned this issue) does use a User-Agent. Don't ask me why. Number 2: You can also use the plugin to block by IP, which is harder to track down, but the plugin still makes the blocking easier. Number 3: There may be sites or User-Agents that you specifically know of that you want to receive a different version (say, truncated) of your content. This plugin will soon allow you to do just that.
Also, I'm going with Fake Rake on the Google ranking issue - Google shouldn't penalize you when some splogger (over whose site you have no responsibility) links to you. This is the sentiment I've received when making this request on Google's search ranking webmaster group.
Jessica: Yes, I do believe Bitacle is a certain kind of splog. The steal content for their own financial gain. If they're not technically a "splog" then they deserve at least as much ire, if not more for being so blatant about it.
Owen. You. Are. Awesome. :)
So, me guesses I un-do what we did the other day and install this instead? ;-) ;-)
Thanks Owen. I was really hoping they were somewhat legit. They have all my content over there. I will tryout your plugin tomorrow. Thank you :)
valerie: Well, I would recommend to anyone who knows how to use their .htaccess and understand what's going on to do that instead, primarily because it's more efficient - it doesn't hit your database or run parts of WordPress at all, and so the sploggers aren't stealing your CPU time either.
But if you aren't into maintaining all of that or don't understand it or are scared to mess with it, then go ahead and install the plugin. It doesn't do exactly what I wrote for you the other day (no Bitacle images) primarily because it's not targeted only at Bitacle, but at any splogger.
[...] UPDATE: Check out Owen’s post on this, and if you’re a WordPress user, download his new plugin that will do the same thing I mentioned above for you! [...]
Thanks for the plugin Owen, I really appreciate it. Thanks to you and Valerie, I can now feel a bit relaxed about this whole drama. I hope google cancels their adsense account.
You might want to consider renaming this plugin so as not to be confused with these people who have a dubious range of products around "protecting your content".
"Anti" and "Leech" both had meaning before that site used them as their name. Similarly, I think there's a spiced ham with an interesting name.
In case you really were worried about the plugin's name - no, this plugin is not affiliated with that site.
Thanks for the plugin. It's really good :)
I'd be interested to know what the issue is with the ordered list feedburner plugin, and would also be interested in somehting similar added to antileech so that I can ditch that plugin entirely. I get the impression that OL feedburner plugin is widely used so it may be worth informing the author of the problem since I doubt he is aware of it.
Dear Sir ,
My Question is ;
Publisher sets his own RSS himself. and he decided how many part can be read via RSS.
if I use this FEED and If I pubish the content which was directly received via RSS ( without any replacement ) and if I add at the end of the content , that where it was taken from with a back link... is it illegal ?
Regards,
Turkay
Turkay: Yes, under most circumstances, that is illegal.
Unless the author of the RSS content has granted you permission to re-publish that content, then you are violating his copyright on that content.
Consider that you would not buy or borrow a book from the library and then produce your own copies of that book for distribution. Regardless of the means of delivery, or the cost of that content, copyright is attibuted only to the authors of the content. If you are not granted the right of publication by the copyright author, you would break the law in publishing it.
If you're looking for a less-restrictive law, you might consider publishing content you create using a Creative Commons license, with which you still retain the copyright, but have granted certain republication permissions by default.
Another good alternative is http://www.anticrawl.com/ which will work on all websites, and only allow Google, MSN and Yahoo in.
I haven't looked at the code for a while, but last time I did, it was blocking Alexa, because of the Wayback Machine.
Archive.org by some is looked on as the biggest splog site on the internet, though I personally find it facinating.
I try to be realistic about my content. If you are using a blogging platform for publishing your content, you more than likely are quite happy for people to quote you with a link back to your site, and a trackback.
If you publish an RSS feed ad ping it out to agregators, the content is going to get used, and you are effectively asking for it to be used (with reasonable attribution)
Many authority sites are just rehashed RSS feeds from AP, plus a mixture of syndicated articles.
Blogs that feature a mixture of 3rd party articles and headlines from RSS feeds are looked on as splogs.
The little guys (quite often) are victimised as being sploggers because they can't do it as well as the international press.
Having your content picked up on PR5 and PR6 "splogs" doesn't hurt your own search engine ratings any more than having the content appearing on a service like Technorati.
[...] If you have been reading the dashboard or Planet Wordpress you have probably noticed Help Defeat the Sploggers with AntiLeech. The problem? Bitacle, supposedly a “RSS reader” has completely ripped of all the content from the Spoken for.org. Then republishing it as their own content, links (except for the title) all lead to other place now to the Bitacle site. Without permission or authorization. Adding ads to the posts and changing the author. There are many blogs and sites that have been copied, including my site blog.fileville.net! Though the content Bitacle has copied only posts since a few months ago with The Vista Experiment, my content which I do not want being copied was and now ad filled just like other blogs. Bitacle’s home also has appeared to have been copied from a site called Netvibes.com. [...]
Andy: That Anticrawl site looks like something right out of the MLM sales handbook!
The latest version (1.5) of the plugin now supports FeedBurner redirection (like the Ordered List plugin, but a tad more secure), a choice of output formats (generated posts, truncated original posts, or a custom block of text), and a couple more toggle options for output.
Your updated feedburner whatsit is great. This fulfils a request I had from the original plugin as well (the ability to track any feed) thanks a lot!
Thanks for the great plugin. I did have the Feedburner plugin installed, but I deactivated it and added yours. In my opinion, the redirect works great. I have yet to detect any sploggers, but I have noticed in my stats that some ip addresses access way too many pages. Thanks for the relative peace of mind.
[...] » Help Defeat the Sploggers with AntiLeech AntiLeech does not prevent the splogger bots from accessing your site. No, it does better than that. It produces a fake set of content especially for them that includes links back to your site (and mine, too, ok?) and sends it only to them. [...]
>> Andy: That Anticrawl site looks like something right out of the MLM sales handbook!
Robert does sell other scripts, but not very often and doesn't bombard you with emails about other products. It is mainly used to script update notification.
Anticrawl protects your content with a captcha which appears after viewing a number of pages, unless you are Google Yahoo or MSN.
He is certainly not MLM, but many of his products are intended for use by people making a living on the internet.
[...] Me? I support full feeds. I think there are better solutions to the problems. Feed advertising can solve problems for some who think full feeds is anti-revenue. There are multiple solutions like Feedburner or Pheedo. Splogging can be dealt with approaches like AntiLeech or reporting them or this guide from Lorelle against content theft. [...]
[...] Help Defeat the Sploggers with AntiLeech by Owen Winkler [...]
[...] With the recent release of AntiLeech, an anti splog plugin for WordPress by Owen Winkler, there finally exists a real method of fighting back against content scraping thieves. AntiLeech is a plugin for WordPress that attempts to serve up fake content to known splogs. The plugin identifies splogs by either their User-Agents or IP address (user supplied). From the plugin page: What does AntiLeech do? AntiLeech does not prevent the splogger bots from accessing your site. No, it does better than that. It produces a fake set of content especially for them that includes links back to your site (and mine, too, ok?) and sends it only to them. [...]
Owen, is there any way to find out what fake content looks like or says? I wouldn't mind customizing it.
Thanks,
Roman
The latest version of the plugin (currently 1.7) allows you to set your own fake content directly from the admin page.
Here are the built-in messages, which are rotated through your feeds so that 10 unique fake post contents are always available:
* Is this site attempting to steal $user's content? It seems like it.
* $user never authorized this site to copy this site content, but it looks like they did it anyway.
* The copyright on $user's work doesn't extend to the owner of this site.
* If you want a copy of $user's content, you really ought to ask first.
* If you see advertising on this page, $user - the author of this content - probably isn't seeing any of that revenue.
* Did this site obtain $user's permission to re-publish this content? No!
* Here's just another instance of someone trying to make a buck by taking $user's content without permission.
* Where is justice in the world when this site steals $user's original, copyrighted works?
* Aren't you sick of this site stealing $user's content for their own use?
* But wait, who does this content really belong to? Not this site, but $user, from whom they stole it!