Back to year 2006 when I was working for my first sphinxsearch project I was playing with stopwords files. Stopwords is basically a small set of highly frequent words you often don’t want to search for (like “I”, “Am”, “The”, etc). For most sphinx instances they only wasting index space and slower your search queries by finding all occurrences of these non-important words.
Say if you are searching for “when is jane’s birthday” you are actually looking to find documents with “jane’s birthday”, and you don’t really care about lot’s of documents (blog posts, news articles, etc) with only “when” and “is” inside.
Remove those high frequency words from search index is usually smart move and ages ago I’ve created two stopword file samples which I’m using by now.
stopwords.txt is a top 100 most frequent words in my blog post collection while stopwords-500.txt as you might expect is a 500 top frequent words. They are old, but not yet included in sphinx distribution so I would suggest to start with stopwords.txt and add it using stopwords option to your sphinx config file as below:
stopwords = /path/to/stopwords.txt
You could also download stopword files using wget:
wget http://astellar.com/downloads/stopwords.txt wget http://astellar.com/downloads/stopwords-500.txt
P.S. If you found this article useful please share it!