Notes on Using Lucene for IR Experiments


These notes are based on Lucene-5.4.0 (the archive has all the previous versions).

$LUCENEHOME used in the text refers to the directory the software is unpacked and place in.

Stemmers

Here are some stemmer implementations in Lucene. In brackets are names they are commonly known by:

PorterStemFilter
KStemFilter (Krovetz stemmer)
SnowballFilter
EnglishMinimalStemFilter (S-Stemmer)

See the package org.apache.lucene.analysis.en for the full list of stemmers.

Models

The models in Lucene are to in the 'models' file. In the first column are names I make Lucene use to refer to the names of Java classes listed in the second column. Since these strings would show up in the run tag, it was necessary to map them to shorter strings for readability.

bm25L        BM25Similarity
dfrL         DFRSimilarity
defaultL     DefaultSimilarity
lmdirichletL LMDirichletSimilarity
ibL          IBSimilarity
tmpl         TMPL
bm25         BM25
tmple        TMPLe
bm25e        BM25e

bm25L is Lucene's stock implementation, while bm25 is a shorter cleaner version. Those names with the suffix 'e' point to a version that preserves Lucene's technique of packing (encoding) floating point document length values into a byte (see Modifying Models). Run diff on of TMPL.java and TMPLe.java to see what I mean.

A full list of models available in Lucene-5.4.0 is in the package org.apache.lucene.search.similarities.

Modifying Models

The starting point is Lucene's BM25Similarity.java, which was stripped to its bare minimum and transformed into TMPL.java. Here I describe how to craft TMPL.java to implement your own tf-idf scoring formula.

Starting with a Template: TMPL.java

In Lucene's documentation, here is an attempt to explain Lucene's scoring formula. It shows the arithmetic involved in computing the similarity code by color-coding parts of the arithmetic expression and then using the same colors for the text naming the Java methods to map them to the parts of the formula.

Much of the surrounding text describing everything on that page possibly assumes a lot of context which most readers will be unaware of. So here I will use TMPL.java to explain only the parts of the code that is implementing a formula of the form:

TF * IDF / LN

where LN is the length-normalization factor.

A shallow cut-away of this formula would involve the constant K in this way;

TF * IDF / (K + TF)

The following sections explain where and how the three pieces, TF, IDF and K are computed. The rest of the code is better left untouched. Most of the functions have been rigged to return 1 so that the chain-multiplication does not affect the score computation.

TF

The term-frequency value is available to the object and need not be constructed out of anything. The method TFIDFScorer.score() receives the value in the variable 'tf' and is used to compute the weight 'w' as shown:

public class TFIDFScorer extends SimScorer
{
    public float score(int doc, float tf)
    {
        float w = tf / K[(byte)norms.get(doc) & 0xFF] * bw.idf;
        return w;
    }
}

IDF

'N' and 'n', the ingredients of IDF, are initialized in here and 'idf' computed. There are bells-n-whistles, where 'idf' is accumulated as s series-sum, which is possibly because of Lucene's view of queries and documents as having been partitioned into 'fields' which necessitates a sum over the fields. My interpretation is that, control does not step into the 'else' block and even if it does, the sum is safe.

public final SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats)
{
    float N, n, idf, adl, dl;

    idf = 1.0f;
    N   = collectionStats.maxDoc();
    adl = collectionStats.sumTotalTermFreq() / N;

    if (termStats.length == 1) {
        n = termStats[0].docFreq();
    idf = log(N/n);
    }
    else {
        for (final TermStatistics stat : termStats) {
        n = stat.docFreq();
        idf += log(N/n);
        }
    }

    float K[] = new float[256];
    for (int i = 0; i < K.length; i++) {
        dl = decodeNorm((byte)i);
        K[i] = 1.0f;
    }

    return new TFIDFWeight(collectionStats.field(), idf, adl, K);
}

K

NOTE: There is another version of TMPL.java, named TMPLe.java that involves the bits of code that encode K values. The discussion on encoding K, that follows, should be read along with TMPLe.java.

If you have noticed, the computations of K takes place in computeWeight() too, right after IDF. Mathematically K is some function of 'dl' and 'adl', and for this exposition we represent this computation as:

K = f(dl, adl)

It so happens that Lucene stows away the 'dl' for each document at a point in time that is before score computations have taken place and encodes these integers in a single byte; which of course is lossy and is done for reasons of efficiency. Since 'dl' is a byte, it has only 256 possible values and so does f(dl, adl). K is implemented as a 256-element array, each of whose elements is a floating-point number; all the possible values of f(dl, adl). Later K is used as a look-up table.

encodeNorm() is used elsewhere to pack 'dl' into a byte. In TFIDFScorer.score(), norms.get(doc) retrieves that byte, 'dl', for a particular doc. decodeNorm() unpacks that byte to a float using another look-up table NORM[], which was populated with floating-point representation of the 256 possible bytes. The unpacked byte, in the form of a float, is then used to compute f(dl, adl) and stored in K[i]. Later when the normalization factor K is needed, the array K is indexed by typecasting the byte value, 'dl', returned by norms.get(doc).

The pieces of code involved in this follow with bits of commentary:

Populate NORM with 256 possible values and implement the packing and unpacking technique of your choice, making sure they are inverse operations of each other;

private static final float[] NORM = new float[256];
static {
       for (int i = 0; i < 256; i++) {
       NORM[i] = SmallFloat.byte315ToFloat((byte)i);
     }
}

protected float decodeNorm(byte b)
{
    return NORM[b & 0xFF];
}

protected byte encodeNorm(int dl)
{
    return SmallFloat.floatToByte315(dl);
}

Compute the 256 possible values of 'K' by generating the 256 possible values of 'dl';

float K[] = new float[256];
for (int i = 0; i < K.length; i++) {
    dl = decodeNorm((byte)i);
    K[i] = log(dl / adl);
}

Pick the particular value of K corresponding to a byte returned by norms.get(). 'norms' is of type NumericDocValues which encapsulates a per-document numeric value. This is also a point where the weight is computed; score() is self-explanatory.

public class TFIDFScorer extends SimScorer
{
    private final TFIDFWeight bw;
    private final NumericDocValues norms;
    private final float[] K;

    TFIDFScorer(TFIDFWeight bw, NumericDocValues norms)
    throws IOException
    {
        this.bw    = bw;
        this.K     = bw.K;
        this.norms = norms;
    }

    public float score(int doc, float tf)
    {
        float w = tf / K[(byte)norms.get(doc) & 0xFF] * bw.idf;
        return w;
    }
}

Parsing & Indexing a Corpus

java -cp $LUCENEHOME/lib/* IndexTREC -index [w]
                             -docs  [x]
                             -stop  [y]
                             -stem  [z]

[w] - Directory where the index will be placed.
[x] - Directory containing the document corpus.
[y] - A file containing the stop words.
[z] - The class that implements a particular stemming algorithm.

Code: IndexTREC

IndexTREC.java

Here is a rundown on modifying Lucene's indexing structures to process TREC data. Everything happens with this block of code in main(), after command-line arguments have been parsed and starting points determined:

Directory dir         = FSDirectory.open(Paths.get(indexPath));
TrecAnalyzer analyzer = new TrecAnalyzer(opt);
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer    = new IndexWriter(dir, iwc);
indexDocs(writer, docDir);
writer.close();

Functionality for parsing the corpus is implemented in these functions:

public static class DocVisitor extends SimpleFileVisitor<Path>
static void parseDoc(IndexWriter writer, Path file)
static void indexDocs(final IndexWriter writer, Path path)

A parser, parseDoc() is plugged into DocVisitor.visitFile(). parseDoc() uses the Jsoup library to parse the mark-up (XML-like) in TREC documents and pick all the content between any tag that may exist within a tag (which is a unit of TREC document), except the contents of the tag, which it passes on to Lucene to be used as unique identifiers for the documents. indexDocs() uses a DocVisitor object to walk a directory tree and gobble up files. From the programmer's point of view, the parser is the only bit of engineering that she is able to do, and the rest of the work is wiring loose ends and plugging open ports. The block of code in main(), is a series of transformations up to the point of leaving it to indexDocs() which sets the ball rolling:

indexDocs(IndexWriter(Directory, IndexWriterConfig(TrecAnalyzer(stop, stem))))

Code: TrecAnalyzer

TrecAnalyzer.java

The constructor reads a list of stop words into memory and stows away the name of the stemmer object. Then in TokenStreamComponents the term-processing pipeline is set up, as before, in a series of transformations:

TokenStreamComponents(StemmerX(StopFilter(LowerCaseFilter(WhitespaceTokenizer()))))

This is a conceptual view, where I have quoted the actual object name from the code (except StemmerX which is meant to be one of several stemmer object from Lucene's libraries chosen at run-time using reflection), and is not syntactically correct. Nonetheless it sketches the idea.

Retrieval

java -cp $LUCENEHOME/lib/* BatchSearch -index      [v]
                               -queries    [w]
                               -similarity [x]
                               -stop       [y]
                               -stem       [z]

[v] - Path to the directory that contains the index.
[w] - The file containing TREC queries.
[x] - The class name of Lucene's similarity model.
[y] - A file containing the stop words.
[z] - The class that implements a particular stemming algorithm.

Code: BatchSearch

BatchSearch.java

An implementation of search (retrieval) using a set of TREC queries. Its interface is not unlike IndexTREC in that command-line arguments are parsed to determine the starting points and the block of code in main() does everything. Again, other than parsing a TREC query, and printing search results in TREC format (TREC runs), most of the work involves tying loose ends.

IndexReader reader       = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher   = new IndexSearcher(reader);
searcher.setSimilarity(similarity);
TrecAnalyzer analyzer    = new TrecAnalyzer(opt);
SimpleQueryParser parser = new SimpleQueryParser(analyzer, field);

org.jsoup.nodes.Document soup;
String str, qid, txt;

str  = FileUtils.readFileToString(new File(queries));
soup = Jsoup.parse(str);

for (Element elm : soup.select("TOP")) {
    qid = elm.child(0).text().trim();
    txt = elm.child(1).text();
    Query query = parser.parse(txt);
    doBatchSearch(searcher, qid, query, simstr, NUMRET);
}

reader.close();

The view of it as a series of transformations, in sketchy pseudocode looks like this:

i = IndexSearcher(IndexReader(DirectoryReader(index)))
i.setSimilarity(SimilarityX)
q = SimpleQueryParser(TrecAnalyzer(stop, stem)).parse()

doBatchSearch(i, q)
{
    result = i.search(q, M)
    prettyprint(result)
}

SimilarityX is the name of the similarity class chosen at run-time using reflection. M is the number of documents to retrieve which is also the length of the search result (or run file).