The Information Worker's Value Chain

posted 05:47PM Mar 22, 2007 with tags collaboration gtd information tips by Lars Trieloff

Multi-Lingual Full-Text Indexing

posted 11:05AM Jun 02, 2006 with tags filtering index information l10n search by Lars Trieloff

Dare Obasanjo writes about Our Multi-Lingual World and Search Indexes. Indexing documents in many languages implies the need to figure out what language the document is written in in order to be able to filter stop words that occur often but are not significant to the meaning of the document.

My idea would be to maintain stopword lists for every supported language and keep the list of found stopwords (from any language-specific list). Before finally determining which stopwords are really discarded, the indexer will find out what language most stopwords are from and only discard the stopwords from the language where most stopwords have been found.

With this method you do not have to keep a list of all words of a language to determine the language of a document and you can use the knowledge you already have, namely language-specific stopword lists.