Fighting Wiki SPAM

posted 09:57AM Jan 07, 2008 with tags google softwaredevelopment spam transformers wiki by Lars Trieloff

Social Software is software that gets spammed. This applies first and foremost to e-mail, but Wikis and Blogs are also preferred targets of wiki spammers. The following rules should act as a guideline for everyone who designs Wiki software, evaluates Wiki software or needs to configure a Wiki that is under attack by spammers.
  1. Understand the way spammers think and work: The main goal of most wiki spammers to to create link spam that will lead search engine crawlers and algorithms, especially Google's into giving their or their customer's websites a higher rank for certain keywords. In order to achieve this goal, they try to create keyword-specific links wherever possible - and this means in your Wiki. In order to create a large number of links in short time, they write small software programs that know how your Wiki software works, and sends the correct request to create new pages or new page revisions. As in the movie "Transformers" Your wiki has become a playing field of robot wars. On the one side "destroy" are the spam-bots, on the other side the googlebot. In order to further familiarize with the way Wiki and Blog spammers think, I recommend The Register's "Interview with a link spammer".
  2. Do not be an attractive target: The best way of preventing Wiki spam is not being a target of Wiki spam. Spammers find Wikis vulnerable to SPAM attacks by searching on search engines for pages that already have been spammed by somebody else. A page that is spammed and found via a Google search is vulnerable and attractive, because the spammer knows, Google will see their spam. In order to not being an attractive target, it is important to remove all existing SPAM from the Wiki and make sure, SPAM is not going to be picked up by Google and other search engines. A mechanism that has been proposed to achieve this goal (and that has been found to be effective) is using the rel="nofollow" attribute in all links that could lead to SPAM. Some wiki software applies this to all outgoing links, some wiki software only to outgoing links that do not conform to a white list of allowed pages, some wiki software only to outgoing links on newly edited pages. The most important rule however is: Exclude all archived versions of wiki pages from being indexed. If your archived pages are being indexed, the spam will be picked up by the search engines, no matter how fast you are to revert the changes. Good techniques to achieve this goal are using the <meta name="robots" content="noindex,nofollow"> tag in the head of all history or archive pages. In order to further familiarize with learning how to exclude pages from being indexed, take a look at The Web Robots Page and Google's Webmaster Central Blog on using the robots meta tag.
  3. Use your community to fight spam: What is SPAM and what is legitimate content? As good as robots might be in creating SPAM, humans beat them by orders of magnitude in detecting SPAM. As your community profits most from your Wiki, you should invite the community to join your spam fighting efforts. This means, regularly observing the "Recent Changes" page, skimming through changes and change descriptions (SPAM robots seldom use change descriptions that fit to the usage patterns of your wiki), and reverting spammed pages to a clean revision. By selecting a Wiki software that has a "revert" or "rollback to last revision" feature, you are giving your users a powerful weapon in the fight against robots, because they can be faster in spotting the SPAM and clicking the link than most robots. If wiki spam is a major nuisance for you, you should engage in the Chongqed community, which is devoted to fighting SPAM in Wikis and retaliating against spammers (which I doubt is worth the effort). If you do not have a community that can help you fighting SPAM, you should probably disable editing in the Wiki or shut it down completely. Without a community, you will loose interest sooner or later as well, but spammers will continue to find your Wiki and attractive target.
  4. Ban content, not users: Lots of spam fighting techniques involve some way of banning certain requests, based on user agents, time of day, frequency of access, IP address range, etc. Other techniques require registration, use CAPTCHAs. All these techniques have a number of disadvantages, the most important aspects are that they create false positives, e.g. blocking legitimate edits that just happen to use the wrong user agent, time of day or IP address range, some like CAPTCHAs and required registration will even raise the barrier of contribution, leading to less legitimate editing attempts, so many users will not even try to contribute to your Wiki and - finally - they can be circumvented by a clever spammer easily. Especially IP address based blocks can be circumvented by using open proxies, dynamic IP addresses or botnets. The only thing that spammers cannot disguise is their intent to create links with specific targets and keywords in your Wiki. The most effective techniques are therefore based on banning content. This means banning URLs based on regular expression patterns (you do not have to build a database of these patterns yourself, there is an excellent one available at http://blacklist.chongqed.org/), content based banning based on regular expression patterns for text in the Wiki, e.g. for keywords (this will be more difficult if your wiki is devoted to gambling or erectile dysfunction medication) or even on the number of URLs posted in one editing steps or the URL-to-other-content-ratio in the post.
  5. Stay up to date: Staying up to date means keeping up to date with the version of your Wiki software, which might not only close bugs and create interesting new features, but also introduce new mechanisms to fight SPAM. And staying up to date means keeping up to date with new techniques used by spammers and ways to fight them. A good resource are the C2 Wiki (THE original Wiki) and the Chongqed Wiki.

Similar rules apply to other kinds of social software that allow user-generated content, especially blogs and social networks, but depending on your application the motivations and techniques of the spammers might vary.

JCR for Roller

posted 01:28PM Nov 22, 2007 with tags jackrabbit jcr jspwiki microsling opensource roller sling weblog wiki by Lars Trieloff

Dave Johnson's wrap-up of the ApacheCon contains some interesting pieces: Of course he mentions the Shindig proposal, which I hold as one of the most interesting developments in the social networking space and he has written a longer paragraph on combining Roller (the weblog software) and Jackrabbit (the JCR repository that is the core of Day's CRX).
The idea of using a content management system to store Roller content keeps on coming up. At ApacheCon EU earlier this year, I spent some time talking to Lars Trieloff (who now works for CMS vendor Day Software) about implementing the Roller back-end interfaces using the Java Content Repository (JCR) APIs instead of the Java Persistence API (JPA) that we use now.
My rationale then was, to allow true free form collaboration in Mindquarry, we needed a weblog system. Mindquarry is based on Jackrabbit and I did not want to open another repository backend then, so I thought about creating a JCR-based backend for Roller that would easily integrate with Mindquarry.
At this ApacheCon, Noel Bergman brought up the topic a couple of times and pointed out that Day Software, has blog and wiki modules that are both backed by JCR. We could do the same thing: create version of Apache Roller and Apache JSPWiki (incubating) that share the same content repository.
The main advantage is that Roller and JSPWiki are content-centric applications. Every well-designed content-centric application moves sooner or later into the direction of having a separate repository layer. In JSPWiki the repository-layer allows you to have different backends, from flat files, to RCS to Apache Roller. In Roller there is a domain-specific repository implementation that is called "model", but if you have read my recent posts on microsling, you will note that using Model-View-Controller (MVC) for content-centric applications is disgusing the content-centric nature of the application, which would need a Content-Behavior-Appearance (CBA) model.
Later, Jukka Zitting (who also works for Day Software), suggested the idea of implementing JPA itself with JCR, thus allowing Roller to store its content in a CMS in a totally transparent fashion. This topic is interesting to me, but I don't fully understand the benefits of backing blogs and wikis with JCR. What new use cases would this support? How do the interesting features of JCR, like versioning for example, bubble up through Roller -- especially if Roller is to support both RDBMS and CMS back-ends?
I had a chat with Jukka yesterday in which he pointed out that implementing JPA based on JCR could be a very-cost effective solution and it might be the ideal way to go to migrate applications stuck with relational or object-relational backends to a content-based backend. Of couse, you would lose many of the advanced features that JCR offers you like full-text-search, observations, versioning because you have to mainatain backwards-compatibility to relational databases.

As always, this is a questions of frameworks and the right time to start a project. When the Roller project was started, there was no JCR, no Sling and no practical and standardized way of implementing content-centric-applications. With this in mind, it is easy to map some of the features JCR is offering to the needs of a blog application:

versioning
keeping all versions of a blog post, allowing incremental writing and backup
observations
notifications for new comments
workspaces
having a draft and a publish area for posts
hierarchy
posts belong to weblogs, comments to posts, etc.
export
backup
queries
tagging, categories, full-text-search

Disclaimer: I am Day's product manager for collaboration products, namely Blog and Wiki and I am a long-time user of Roller and JSPWiki. This blog post was written on Roller, using the JSPWiki plugin.

| Comments[1]

Blogging from Linux Tag: ITerating: Wiki-based Software Guide

posted 11:15AM Jun 02, 2007 with tags doap foaf jena linuxtag opensource rdf semanticweb wiki by Lars Trieloff

I've been invited to moderate the Web 2.0 track of the Linux Tag conference. The first presentation this day was Semantic Web, Wiki and Mashup: how they can all work together by Nicolas Vandenberghe. In his presentation Nicolas introduced ITerating, a software directory for open source software, proprietary software and software as a service that
  • is editable like a wiki
  • pulls data from RSS feeds published by Sourceforge, Freshmeat, etc.
  • Stores everything in a Triple Store powered by JENA
  • Outputs data as RDF+FOAF
  • Outputs data as RDF+DOAP
  • Outputs data as RDF+DublinCore
  • Supports Reviews
To me it looks like the the open source software portal Ugo Cei was looking for and it is one of the first portals I know build upon semantic web technology.

My Methodolgy of writing Documentation with Wikis

posted 11:40AM Feb 07, 2007 with tags mindquarry techdoc tips wiki by Lars Trieloff

Wikis are a hot topic in technical documentation. Adrian Sutton has some interesting remarks why he thinks, Wikis and user-contributed documentation do not lead to high-quality documentation: Creating Great Documentation:
First and foremost, if you're thinking about improving a product's documentation, read Kathy Sierra's How to get users to RTFM. Make sure your documentation covers each of the five types required: Reference Guide, Tutorial, Learning/Understanding, Cookbook/Recipe, Start Here.
Adrian says that Wikis tend to create mostly reference documentation. Additionally Wikis are not versioned with your products source code.

I would add that Wikis often lead to topic-oriented authoring, Norman Walsh has some interesting takes on the consequences of topic-oriented authoring: Good for reference, bad for tutorial, learning and understanding, bad for start here documentation.

Scott Abel points me to Tom Johnson's Using Wikis as Project Documentation Tools. His main complaints are:

Wiki wysiwig’s are primitive (technical documentation can have some complicated styles, with several levels of lists).
Once all the info is in the wiki, how do I generate a manual or online help? I don’t want to maintain two separate files.
I would agree to this complaints because Wikis are not the best tools for technical writing, they are websites that are easily editable, nothing more and nothing less. But there are reasonable arguments for using Wikis and user-contributed feedback for documentation:
I’m saying, let the technical writer use a wiki as his or her documentation base,” says Johnson. “Make sure all project members are familiar with the wiki’s location and procedures for editing it. Then, encourage the project team to comment, review, add, edit, and otherwise adjust the documentation through the life of the project. The writer can shape, stylize, make consistent, and organize the content to make it usable. Most likely the writer will write 75% of the content anyway, but it will be more informed and accurate.
Wikis and user-contributed generation are a way of generating feedback and an additional source of information of the technical writer. Another example are the user-contributed comments of the PHP documentation Adrian points to.

My methodology of creating technical documentation uses Wikis in two places:

  1. I plan and organize the documentation project using an issue tracking system and create tasks for every step in the documentation process
  2. I 'harvest' product development Wikis for information about the software I am documenting. This is the first use of Wikis - a source of information
  3. I create a content outline of the planned document in the Wiki and invite other team members to comment and correct the content outline. The Wiki here is a space for distributed brain-storming.
  4. I write the document using DocBook-XML, WYSIWYG-XML-editors and share the in-progress document and illustrations using a version control system. As I am using DocBook and Mindquarry's file sharing, concurrent editing of modular documents is easy.
  5. Reviewers and copy-editor use the issue tracking system to create comments and remarks to the documentation.
  6. After releasing the document, the issue tracking system is used to track comments and suggestions for improvements. As the Wiki keeps evolving I have a good starting point for a second revison of the document.

| Comments[3]

Developing Documentation with Wikis

posted 11:09PM Feb 03, 2007 with tags docbook mindquarry techdoc wiki by Lars Trieloff

Via Gordon Meyer I found Dan Wood's weblog. Dan is one of the developers of Sandvox - one of the easiest and best-looking ways to publish a web site (I've blogged about Sandvox before) and describes his technique of using a Wiki for creating a user manual.

The important thing to note here, is that the Wiki is not the user manual, it is just the tool for creating it. Most wikis have serious problems with usability when they are used as user manuals (no wonder, they are designed to ease the publishing and editing process) - an issue Dan mentions and one thing Dan does not mention, but that often occurs in Open Source projects: Wikis are a good excuse for forgetting documentation and delivering bad documentation.

What Dan and his team does is authoring the manual in the Wiki, then converting it into a proper Mac OS X online help. From my point of view, Wikis are not the optimal tool for authoring technical documentation, there are many specialized tools for this purpose that yield higher productivity, but this does not mean that Wikis do not have their place in a technical documentation process.

Wikis are ideal for drafting documents, creating content outlines and collecting resources before writing technical documentation. When it comes to actually writing documentation, specialied tools like XML-editors for DocBook come into play. In an ideal world you could at this point continue using the Wiki-principle of collaborative authoring and with Mindquarry's combined versioned file sharing, wiki and task management you've got all tools in one package.

Wiki case study from The British Council

posted 11:22AM Jun 27, 2006 with tags business collaboration study wiki by Lars Trieloff

Wikis are an important and popular tool for collaboration in many open source projects. In the article Using Wikis on the Intranet: The British Council Case Study Maish Nichany shows how the Wiki concept can be applied to other knowledge worker teams outside the open source or software development cultures. The most important points are that
  • Wikis need the right culture. A culture that fosters communication, talking and negotiating. A wiki is a tool for better communication. If there is not already a culture of communication in an organization, Wikis will not establish it.
  • Wikis need concrete Wiki applications, or "A practical, compelling reason to collaborate, to share". Again, a Wiki is a tool for a cause. Without this cause, e.g. a collaborative effort that needs to be optimized, a Wiki is without use.
  • A champion who can show the way. Someone will introduce the Wiki and this someone must be the one to encourage and invite her teammembers to collaborate.

As a side-note: Wikis are just one means of improving communication in teams of information workers.

Webmontag in Berlin (05-22-06)

posted 01:35PM May 23, 2006 with tags berlin blogs collaboration microformats semanticweb ting webmontag wiki by Lars Trieloff

I've attended yesterday's Webmontag in Berlin. It was quite interesting, but the interesting parts were not the ones I expected:

Ting and Gobby

Mattis Manzel talked about Ting. A ting is a collaborative editing session that is supported by three tools: A collaborative editor like Gobby, a Voice-over-IP client like Skype or Teamspeak (Mattis said Teamspeak's push-to-talk-feature makes it the best program for tings because it does not distract from writing and disciplines the users) and an extension of MediaWiki that will save the exported document (The extension seems to be Mutante/MoonEdit and was originally designed for the proprietary MoonEdit).

The main idea is that a bunch of people meets at a specified time at a certain server and launces their collaborative editors. The appointment for time and server will be made using a wiki page. People start writing and discussing what they are writing by embedding comments into the document and using the VoIP tool. After completion of the ting, which might take from 30 minutes to five hours, the created document is copied into the talk page of the wiki.

From my point of view, collaborative editing is an extremely intesting topic and I see many connections to Wiki software, but I am not sure how the Ting concept could be used for more than geek entertainment.

Structured Blogging

The part was the unexpectedly interesting part. Baju Bitter introduced Structured Blogging, which I head about before, but have seen it as just another way to make blogging even more complicated. After hearing Baju's talk, I've changed my opinion. The basic idea of structured blogging is to define data types for blog entries. For example an weblog entry can be a review of a book or a movie, it can be the announcement of an event and many more. The structured blogging initiative provides a definition of blog entry types and relies on the popular microformats concept which embeds machine-readable data into HTML by using CSS class definitions. Furthermore it provides plugins for two weblog tools that make creating structured weblog entries easier by providing editors that are suited at certain blog entry data types.

Most interesting part of this concept is that there are already aggregators that are utilizing these structured blogging contents.

  • edgeio finds listings of things you would like to sell in blog entries, think of it as a decentralized ebay (which would need Rapleaf integration, of course)
  • incredibooks is a list of book reviews by children and teenagers.

If you are capable of reading german, you should further check out Baju's collection of links, his weblog entry on this Webmontag and the german structured blogging website and forum he maintains.

Readers Edition

Finally Peter Schink of Die Netzzeitung showcased a new Cititzen's Journalism project: Readers Edition which will go live soon. Nothing new, but nice webdesign.

| Comments[1]

Tips for Wiki Implementors

posted 09:56AM Apr 20, 2006 with tags tips wiki by Lars Trieloff

The weblogs a little madness features a post: 10 Things I Hate About Wikis that discusses the shortcomings of most current Wiki software. Some points menitioned in this post are:
  • Wikis replacing documentation
  • Different Wiki syntax for different wikis
  • No semantic markup
  • Poor Navigation
  • Poor versioning support