Just spent the last week tuning our search engine using the latest version of Lucene (4.3.1). While Lucene works amazingly well right out of the box, to get "Google-like" relevancy for your results, you usually need to devise a custom strategy for indexing and querying the particular content your application has. Here are a few tricks we used for our content (which is English-only, jargon-heavy, and contains many terms used only by a few documents), plus some more basic techniques that just took us a while to figure out:
Use a custom analyzer for English text
Lucene's StandardAnalyzer does a good job generally of tokenizing text into individual words (aka "terms"), and it skips English "stopwords" (like the, a, etc) by default — but if you have only English text, you can get better results by using the EnglishAnalyzer. Beyond the tokenizing filters that the StandardAnalyzer includes, the EnglishAnalyzer also includes the EnglishPossesiveFilter (for stripping 's from words) and the PorterStemFilter (for chopping off common word suffixes, like removing ming from stemming, etc).
Because some of our text includes non-English names with letters not in the English alphabet (like é in liberté), and we know our users are going to want to search for those names using just English-alphabet letters, we implemented our own analyzer that included the ASCIIFoldingFilter on top of the filters in the regular EnglishAnalyzer. This filter converts characters not in the (7-byte) ASCII range to the ASCII characters that they resemble most closely; for example, it converts é to e (and © to (c), etc).
A custom analyzer is easy to implement; this is what ours looks like in java (the matchVersion and stopwords variables are fields from its Analyzer and StopwordAnalyzerBase superclasses, and the TokenStreamComponents is an inner class of Analyzer):
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishPossessiveFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
public class CustomEnglishAnalyzer extends StopwordAnalyzerBase {
/** Tokens longer than this length are discarded. Defaults to 50 chars. */
public int maxTokenLength = 50;
public CustomEnglishAnalyzer() {
super(Version.LUCENE_43, StandardAnalyzer.STOP_WORDS_SET);
}
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
source.setMaxTokenLength(maxTokenLength);
TokenStream pipeline = source;
pipeline = new StandardFilter(matchVersion, pipeline);
pipeline = new EnglishPossessiveFilter(matchVersion, pipeline);
pipeline = new ASCIIFoldingFilter(pipeline);
pipeline = new LowerCaseFilter(matchVersion, pipeline);
pipeline = new StopFilter(matchVersion, pipeline, stopwords);
pipeline = new PorterStemFilter(pipeline);
return new TokenStreamComponents(source, pipeline);
}
}
Note that when you use a custom analyzer for indexing, it's important to use the same (or least a similar) analyzer for querying (and vice versa). For example, the EnglishAnalyzer will tokenize the phrase it's easily processed into two terms: easili (sic) and process. If you index this text with the EnglishAnalyzer, searching for the terms it's, easily, or processed will find no results — you have to create the query using the same analyzer to make sure the terms for which you're querying are actually easili and process.
You can use Lucene's StandardQueryParser to build an appropriate query for you out of a phrase, using Lucene's fancy querying syntax; or you can simply tokenize the phrase yourself with the following code, and build the query out of it yourself:
import java.util.ArrayList;
import java.util.List;
import java.io.StringReader;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
...
List<String> tokenizePhrase(String phrase) {
List<String> tokens = new ArrayList<String>();
TokenStream stream = new EnglishAnalyzer(Version.LUCENE_43).tokenStream(
"someField", new StringReader(phrase));
stream.reset();
while (steam.incrementToken())
tokens.add(stream.getAttribute(CharTermAttribute).toString());
stream.end();
stream.close();
return tokens;
}
Use a custom scorer
The results you get back from a query and their order are heavily influenced by a number of factors: the text you have in your index, how you've tokenized and stored the text in the different fields of your index, and how you structure the query itself. You can also influence the ordering of results to a lesser degree by using a custom Similarity class when you build your index.
Lucene's default similarity class uses some fancy math to score the terms in its index (see this Lucene Scoring tutorial for a simpler explanation of the scoring algorithm), and you'll probably want to tweak only one or two of those factors. We implemented our own custom Similarity class that completely ignores document length, and provides a bigger boost for infrequently-appearing terms:
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
@Override
public float lengthNorm(FieldInvertState state) {
// simply return the field's configured boost value
// instead of also factoring in the field's length
return state.getBoost();
}
@Override
public float idf(long docFreq, long numDocs) {
// more-heavily weight terms that appear infrequently
return (float) (Math.sqrt(numDocs/(double)(docFreq+1)) + 1.0);
}
}
Once implemented, you can use this CustomSimilarity class when indexing by setting it on the IndexWriterConfig that you use for writing to the index, like this:
import java.io.File;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
...
void indexSomething() {
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_43);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
config.setSimilarity(new CustomSimilarity());
FSDirectory directory = FSDirectory.open(new File("my-index"));
IndexWriter writer = new IndexWriter(directory, config);
// ... index something ...
writer.close();
}
Build your own query
Probably the single biggest way we improved our "result relevancy" in the eyes of our users was to build our queries programmatically from a user's query input, rather than asking them to use Lucene's standard query syntax. Our algorithm for generating queries first expands any abbreviations in the query (not using Lucene, just using an in-memory hashtable of our own custom list of abbreviations); then it builds a big query consisting of:
- the exact query phrase (with a little slop), boosted heavily
- varying combinations of the terms in the query phrase, boosted according to the number of matching terms
- individual terms in individual fields (using the boost associated with those fields)
- individual terms with no boost
This querying strategy compliments our indexing strategy, which is to index a few important fields of each document separately (like "name", "keywords", etc) with boost added to those fields at index time; and then to index all the text related to each document in on big fat field (the "all" field) with no boost associated with it. The parts of the query that check for different terms appearing in the same document (#1 and #2 from the list above) rely on the "all" field; whereas the parts of the query that check in which fields the terms appear (#3 and #4) make use of the other, specially-boosted fields.
Doing it this way allows us to instruct Lucene to weight results that contain more matches of different terms (or the exact phrase) more heavily than results that simply match the same term many times; but also to weight matches in important fields (like "name" and "keywords") above matches from the general text of the document.
The actual query-building part of our code looks like this (I removed the abbreviation-expanding bits for simplicity, though). The fields argument is the list of custom fields to query; the defaultField argument is the name of the "all" field; and it uses the tokenizePhrase() method from above to split the phrase into individual words:
import java.lang.Math;
import java.util.List;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
...
Query buildQuery(String phrase, List<String> fields, String defaultField) {
List<String> words = tokenizePhrase(String phrase);
BooleanQuery q = new BooleanQuery();
// create term combinations if there are multiple words in the query
if (words.size() > 1) {
// exact-phrase query
PhraseQuery phraseQ = new PhraseQuery();
for (int w = 0; w < tokens.size(); w++)
phraseQ.add(new Term(defaultField, words.get(w)));
phraseQ.setBoost(words.size() * 5);
phraseQ.setSlop(2);
q.add(phraseQ, BooleanClause.Occur.SHOULD);
// 2 out of 4, 3 out of 4, 4 out of 4 (any order), etc
// stop at 7 in case user enters a pathologically long query
int maxRequired = Math.min(tokens.size(), 7);
for (int minRequired = 2; minRequired <= maxRequired; minRequired++) {
BooleanQuery comboQ = new BooleanQuery();
for (int w = 0; w < tokens.size(); w++)
comboQ.add(new Term(defaultField, words.get(w)), BooleanClause.Occur.SHOULD);
comboQ.setBoost(minRequired * 3);
comboQ.setMinimumNumberShouldMatch(minRequired);
q.add(comboQ, BooleanClause.Occur.SHOULD);
}
}
// create an individual term query for each word for each field
for (int w = 0; w < tokens.size(); w++)
for (int f = 0; f < fields.size(); f++)
q.add(new Term(fields.get(f), words.get(w)), BooleanClause.Occur.SHOULD);
return q;
}
Boost important fields when indexing
When we do the document indexing, we set the boost of some of the important fields (like "name" and "keywords", etc), as described above, while dumping all the document's text (including name and keywords) into the "all" field. Following is an example (in which we use our own customized FieldType so that we can configure the field with the IndexOptions that the result highlighter needs, discussed later). The toDocument() method translates some particular type of domain object to a Lucene Document, with appropriate "kewords", "name", "all", etc fields; it would be called by our indexing process (from the indexSomething() method above) for each instance of that domain type that we have in our system in order to create a separate document with which to index each domain:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;
...
protected static final FieldType TEXT_FIELD_TYPE = getTextFieldType();
static FieldType getTextFieldType() {
FieldType type = new FieldType();
type.setIndexed(true);
type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setTokenized(true);
return type;
}
Document toDocument(MyDomainObject domain) {
Document doc = new Document();
Field keywordsField = new Field("keywords", domain.keywords, TEXT_FIELD_TYPE);
keywordsField.setBoost(3f);
doc.add(keywordsField);
Field nameField = new Field("name", domain.name, TEXT_FIELD_TYPE);
nameField.setBoost(2f);
doc.add(nameField);
// ... other fields ...
StringBuilder all = new StringBuilder().
append(domain.kewords).append("\n").
append(domain.name).append("\n").
append(domain.text).append("\n").
append(domain.moreText).append("\n").
toString();
Field allField = new Field("all", all, TEXT_FIELD_TYPE);
doc.add(allField);
return doc;
}
Filter by date with a NumericRangeQuery
Many of our individual documents are relevant only during a short time period, with the exact start and end dates defined by the document. When we query for anything, we query against a specific day chosen by the user. In our Lucene searches, we implement this with a filter that wraps a pair of NumericRangeQuerys, querying the "startDate" and "endDate" fields (although a more common scenario in other applications, however, might be to have a single "publishedDate" for each document, and allow users to choose separate start and end dates against which to filter -- in that case, you'd use a single NumericRangeQuery). We index the "startDate" and "endDate" fields like this, using an integer field of the form 20010203 to represent a date like 2001-02-03 (Feb 3, 2001):
import org.apache.lucene.document.Document;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;
...
protected static final FieldType DATE_FIELD_TYPE = new FieldType();
static {
TEXT_FIELD_TYPE.setIndexed(true);
TEXT_FIELD_TYPE.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
TEXT_FIELD_TYPE.setNumericType(FieldType.NumericType.INT);
TEXT_FIELD_TYPE.setOmitNorms(true);
TEXT_FIELD_TYPE.setStored(true);
}
Document toDocument(MyDomainObject domain) {
Document doc = new Document();
Field startField = new Field("startDate", domain.startDate, DATE_FIELD_TYPE);
doc.add(startField);
Field endField = new Field("endDate", domain.endDate, DATE_FIELD_TYPE);
doc.add(endField);
// ... other fields ...
return doc;
}
Then we build a filter like this, caching a separate filter instance per date (dates again represented in integer form like 20010203 to stand for 2001-02-03):
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.QueryWrapperFilter;
...
synchronized protected Map cachedFilters = new HashMap();
Filter getDateFilter(int date) {
Filter filter = cachedFilters.get(date);
if (filter == null) {
BooleanQuery q = new BooleanQuery();
// startDate must be on or before the specified date
q.add(NumericRangeQuery.newIntRange(
"startDate", 0, date, true, true
), BooleanClause.Occur.MUST);
// endDate must be on or after the specified date
// 30000000 represents the distant future (just prior to the year 3000)
add NumericRangeQuery.newIntRange(
"endDate", date, 30000000, true, true
), BooleanClause.Occur.MUST);
filter = new QueryWrapperFilter(q);
cachedFilters.put(date, filter);
}
return filter
}
Use a SearcherManager for multi-threaded searching
To manage the access of multiple threads searching the index, Lucene provides a simple SearchManager class. Once the index has been created, you can instantiate it and call its acquire() method to check out a IndexSearcher instance.
We needed to initialize our IndexSearcher instances with our custom Similarity class (discussed above), so we initialized the manager with a custom SearcherFactory, which then allowed us to customize the IndexSearcher initialization process:
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.SearcherFactory;
public class CustomSearcherFactory extends SearcherFactory {
@Override
public IndexSearcher newSearcher(IndexReader r) throws IOException {
IndexSearcher searcher = new IndexSearcher(r);
searcher.setSimilarity(new CustomSimilarity());
return searcher;
}
}
To use it, we create a SearcherManager instance when initializing the index (in the init() method) — note that the index must already exist before creating the SearcherManager; and then acquire and release the IndexSearcher it provides whenever we actually need to run a search on the index (in the search() method):
import java.io.File;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
...
protected SearcherManager searchManager;
protected init() {
FSDirectory directory = FSDirectory.open(new File("my-index"));
searchManager = new SearcherManager(directory, new CustomSearcherFactory());
}
public TopDocs search(Query query, Filter filter, int maxResults) {
IndexSearcher searcher = searchManager.acquire();
try {
return searcher.search(query, filter, maxResults);
} finally {
searchManager.release(searcher);
}
}
After re-indexing, make sure to call maybeRefresh() on the SearchManager to refresh the managed IndexSearchers with the latest copy of the index. In other words, indexSomething() method from above would be finished like this:
void indexSomething() {
// ... index something ...
writer.close();
searchManager.maybeRefresh();
}
Highlight results with a PostingsHighlighter
The PostingsHighlighter class is the newest implementation of a results highlighter for Lucene (the component that comes up with the fragments of text to display for each result in the search-results UI). It's only been part of Lucene since the 4.1 release, but our experience has been that it selects more-clearly relevant sections of the text than the previous highlighter implementation, the FastVectorHighlighter.
The first step to using a results highlighter is to make sure that you include at index time the data that the highlighter will need at search time. With the FastVectorHighlighter, we used this configuration for a regular indexed field:
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;
...
static FieldType getTextFieldType() {
FieldType type = new FieldType();
type.setIndexed(true);
type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
type.setStored(true);
type.setStoredTermVectorOffsets(true);
type.setStoredTermVectorPayloads(true);
type.setStoredTermVectorPositions(true);
type.setStoredTermVectors(true);
type.setTokenized(true);
return type;
}
But with the PostingsHighlighter, we found we didn't need to store the term vectors anymore — but we did need to index the term offsets:
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;
...
static FieldType getTextFieldType() {
FieldType type = new FieldType();
type.setIndexed(true);
type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setTokenized(true);
return type;
}
The PostingsHighlighter, by default, selects complete sentences to show. We have a lot of text that isn't in the form of proper sentences, however (much of our text isn't in the form of sentences begun with a captial letter and completed with a period and whitespace), so we subclassed the PostingsHighlighter with a class that uses a custom BreakIterator implementation that selects just a few words around each term to display.
With or without a custom BreakIterator, it's easy to use the PostingsHighlighter. You do need to have the IndexSearcher and TopDocs instance from the initial search results to use the PostingsHighlighter, so you might as well do both the search and the highlighting in the same method, returning the combined results in some intermediate data structure. For example, we can use a custom inner class called Result for each individual result, and combine one Lucene document object from the search results with the corresponding highlights text string from the highlighter in each returned Result:
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.postingshighlight.PostingsHighlighter;
...
public class Result {
public Document document;
public String highlights;
public Result(Document document, String highlights) {
this.document = document;
this.highlights = highlights;
}
}
protected PostingsHighlighter highlighter = new PostingsHighlighter();
public List<Result> search(Query query, Filter filter, int maxResults) {
IndexSearcher searcher = searchManager.acquire();
try {
TopDocs topDocs = searcher.search(query, filter, maxResults);
// select up to the three best highlights from the "all" field
// of each result, concatenated with ellipses
String[] highlights = highlighter.highlight("all", query, searcher, topDocs, 3);
int length = topDocs.scoreDocs.length;
List<Result> results = new ArrayList<Result>(length);
for (int i = 0; i < length; i++) {
int docId = topDocs.scoreDocs[i].doc;
results.add(new Result(searcher.doc(docId), highlights[i]));
}
return results;
} finally {
searchManager.release(searcher);
}
}
With a tree, index leaves only
Some of our data is in hierarchical form, and we display the search results for that data in tree from. Rather than indexing all the nodes in the tree, however, we just index the leaves, and make sure that each leaf also includes the relevant text from its ancestors.
We also include the necessary info to render the leaf's branch as a separate, non-indexed "hierarchy" field in each leaf. When the leaf is returned as a search result, we build the branch out of that "hierarchy" field, and then merge the branches together to show each leaf in the context of the full tree.
This is the field configuration we use for the non-indexed "hierarchy" field:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.FieldInfo;
...
protected static final FieldType NON_INDEXED_FIELD_TYPE = getNonIndexedFieldType();
static FieldType getNonIndexedFieldType() {
FieldType type = new FieldType();
type.setIndexed(false);
type.setOmitNorms(true);
type.setStored(false);
return type;
}
Document toDocument(MyDomainObject domain) {
Document doc = new Document();
// ... other fields ...
String hierarchy = domain.getHierarchyText();
Field allField = new Field("hierarchy", hierarchy, NON_INDEXED_FIELD_TYPE);
doc.add(allField);
return doc;
}
Use a SpellChecker for auto-complete suggestions
For auto-complete suggestions in our application's search box, we created a custom search index of common words in our application domain that were at least six letters long, and used Lucene's SpellChecker class to index and search this word list. We skipped words less than six letters long to avoid suggesting simple words when the user has typed in only the first few letters of a word. To build the index, we created a plain text file with one word on each line, and indexed it with the following indexDictionary() method:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.SearcherFactory;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spell.PlainTextDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Suggestor {
File directory = new File("suggestion-index");
SpellChecker spellChecker = new SpellChecker(FSDirectory.open(directory));
public void indexDictionary(File dictionaryFile) {
PlainTextDictionary dictionary = new PlainTextDictionary(dictionaryFile);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
spellChecker.indexDictionary(dictionary, config, true);
}
}
To then search it, we used a simple PrefxQuery for (partial) words less than 5 letters long; for longer words we used the SpellChecker's built-in fuzzy suggestion algorithm (with a 0.2f factor to make it even more fuzzy than the default). The suggestSimilar() method of our Suggestor class will return a list of up to 10 words appropriate as auto-completions for the partial word specified in the argument to suggestSimilar(). It delegates to helper prefixSearch() and fuzzySuggest() methods to actually run the search based on the length of the specified partial word:
protected SearcherManager manager;
protected getSearcherManager() {
synchronized (directory) {
if (manager == null)
manager = new SearcherManager(
FSDirectory.open(directory), new SearcherFactory());
return manager;
}
}
public List<String> suggestSimilar(String s) {
// search with prefix query if less than 5 chars
// otherwise use spellChecker's built-in fuzzy suggestions
return s.length() < 5 ? prefixSearch(s) : fuzzySuggest(s);
}
protected List<String> prefixSearch(String s) {
SuggestionManager manager = getSearcherManager();
IndexSearcher searcher = manager.acquire();
try {
// search for the top 10 words starting with s
Term term = new Term("word", s.toLowerCase())
TopDocs topDocs = searcher.search(new PrefixQuery(term), 10);
int length = topDocs.scoreDocs.length;
List<String> results = new ArrayList<String>(length);
for (int i = 0; i < length; i++) {
int docId = topDocs.scoreDocs[i].doc;
results.add((searcher.doc(docId).get("word"));
}
return results;
} finally {
manager.release(searcher);
}
}
protected List<String> fuzzySuggest(String s) {
// search for 10 most popular words not exactly matching s
String[] similar = spellChecker.suggestSimilar(
s.toLowerCase(), 10, null, null,
SuggestMode.SUGGEST_MORE_POPULAR, 0.2f);
List<String> results = Arrays.asList(similar);
// include queried term if it is itself a recognized word
if (spellChecker.exist(term) {
if (results.isEmpty())
results.append(term);
else
results.set(results.size() - 1, term);
}
return results;
}