Multi-Lingual Full-Text Indexing
My idea would be to maintain stopword lists for every supported language and keep the list of found stopwords (from any language-specific list). Before finally determining which stopwords are really discarded, the indexer will find out what language most stopwords are from and only discard the stopwords from the language where most stopwords have been found.
With this method you do not have to keep a list of all words of a language to determine the language of a document and you can use the knowledge you already have, namely language-specific stopword lists.
I am Product Manager for Collaboration and Digital Asset Management at