Fighting Wiki SPAM

posted 09:57AM Jan 07, 2008 with tags google softwaredevelopment spam transformers wiki by Lars Trieloff

Social Software is software that gets spammed. This applies first and foremost to e-mail, but Wikis and Blogs are also preferred targets of wiki spammers. The following rules should act as a guideline for everyone who designs Wiki software, evaluates Wiki software or needs to configure a Wiki that is under attack by spammers.
  1. Understand the way spammers think and work: The main goal of most wiki spammers to to create link spam that will lead search engine crawlers and algorithms, especially Google's into giving their or their customer's websites a higher rank for certain keywords. In order to achieve this goal, they try to create keyword-specific links wherever possible - and this means in your Wiki. In order to create a large number of links in short time, they write small software programs that know how your Wiki software works, and sends the correct request to create new pages or new page revisions. As in the movie "Transformers" Your wiki has become a playing field of robot wars. On the one side "destroy" are the spam-bots, on the other side the googlebot. In order to further familiarize with the way Wiki and Blog spammers think, I recommend The Register's "Interview with a link spammer".
  2. Do not be an attractive target: The best way of preventing Wiki spam is not being a target of Wiki spam. Spammers find Wikis vulnerable to SPAM attacks by searching on search engines for pages that already have been spammed by somebody else. A page that is spammed and found via a Google search is vulnerable and attractive, because the spammer knows, Google will see their spam. In order to not being an attractive target, it is important to remove all existing SPAM from the Wiki and make sure, SPAM is not going to be picked up by Google and other search engines. A mechanism that has been proposed to achieve this goal (and that has been found to be effective) is using the rel="nofollow" attribute in all links that could lead to SPAM. Some wiki software applies this to all outgoing links, some wiki software only to outgoing links that do not conform to a white list of allowed pages, some wiki software only to outgoing links on newly edited pages. The most important rule however is: Exclude all archived versions of wiki pages from being indexed. If your archived pages are being indexed, the spam will be picked up by the search engines, no matter how fast you are to revert the changes. Good techniques to achieve this goal are using the <meta name="robots" content="noindex,nofollow"> tag in the head of all history or archive pages. In order to further familiarize with learning how to exclude pages from being indexed, take a look at The Web Robots Page and Google's Webmaster Central Blog on using the robots meta tag.
  3. Use your community to fight spam: What is SPAM and what is legitimate content? As good as robots might be in creating SPAM, humans beat them by orders of magnitude in detecting SPAM. As your community profits most from your Wiki, you should invite the community to join your spam fighting efforts. This means, regularly observing the "Recent Changes" page, skimming through changes and change descriptions (SPAM robots seldom use change descriptions that fit to the usage patterns of your wiki), and reverting spammed pages to a clean revision. By selecting a Wiki software that has a "revert" or "rollback to last revision" feature, you are giving your users a powerful weapon in the fight against robots, because they can be faster in spotting the SPAM and clicking the link than most robots. If wiki spam is a major nuisance for you, you should engage in the Chongqed community, which is devoted to fighting SPAM in Wikis and retaliating against spammers (which I doubt is worth the effort). If you do not have a community that can help you fighting SPAM, you should probably disable editing in the Wiki or shut it down completely. Without a community, you will loose interest sooner or later as well, but spammers will continue to find your Wiki and attractive target.
  4. Ban content, not users: Lots of spam fighting techniques involve some way of banning certain requests, based on user agents, time of day, frequency of access, IP address range, etc. Other techniques require registration, use CAPTCHAs. All these techniques have a number of disadvantages, the most important aspects are that they create false positives, e.g. blocking legitimate edits that just happen to use the wrong user agent, time of day or IP address range, some like CAPTCHAs and required registration will even raise the barrier of contribution, leading to less legitimate editing attempts, so many users will not even try to contribute to your Wiki and - finally - they can be circumvented by a clever spammer easily. Especially IP address based blocks can be circumvented by using open proxies, dynamic IP addresses or botnets. The only thing that spammers cannot disguise is their intent to create links with specific targets and keywords in your Wiki. The most effective techniques are therefore based on banning content. This means banning URLs based on regular expression patterns (you do not have to build a database of these patterns yourself, there is an excellent one available at http://blacklist.chongqed.org/), content based banning based on regular expression patterns for text in the Wiki, e.g. for keywords (this will be more difficult if your wiki is devoted to gambling or erectile dysfunction medication) or even on the number of URLs posted in one editing steps or the URL-to-other-content-ratio in the post.
  5. Stay up to date: Staying up to date means keeping up to date with the version of your Wiki software, which might not only close bugs and create interesting new features, but also introduce new mechanisms to fight SPAM. And staying up to date means keeping up to date with new techniques used by spammers and ways to fight them. A good resource are the C2 Wiki (THE original Wiki) and the Chongqed Wiki.

Similar rules apply to other kinds of social software that allow user-generated content, especially blogs and social networks, but depending on your application the motivations and techniques of the spammers might vary.