Notes on Using Terrier for IR Experiments


This is a link to a local copy of Terrier-4.0 , while all the versions can be found on Terrier's download page.

$TERRIERHOME in subsequent text refers to the directory the software is unpacked and placed in.

Stop-Words

Terrier-4.0 is packaged with a file containing 733 stop-words which is to be found here;

$TERRIERHOME/share/stopword-list.txt

Enabling stop-word removal in Terrier-4.0 most probably makes use of this list.

Stemmers

Here are some popular stemming algorithms. There are many more in Terrier-4.0 mentioned on this page:

http://terrier.org/docs/v4.0/javadoc/org/terrier/terms/Stemmer.html

PorterStemmer
WeakPorterStemmer
EnglishSnowballStemmer
SStemmer

SStemmer is an implementation of the S-Stemmer algorithm described in the paper How effective is suffixing? (1991).

The code is a useful guide to modifying stemming algorithms in Terrier-4.0.

$TERRIERHOME/src/core/org/terrier/terms/SStemmer.java

Models

The names of models are in the 'models' file. The file contents are shown below. The first column are names I make Terrier uses to identify the Java class names listed in the second column. Since these strings would show up in the run tag, it was necessary to map them to shorter strings for readability. Also, it is useful to have descriptive strings for cryptic names used in Terrier. The models file provides the facility to do that kind of mapping.

bm25        BM25
dfic        DFIC
dfiz        DFIZ
dfr_bm25    DFR_BM25
dfree       DFRee
dfreeklim   DFReeKLIM
dlh         DLH
dlh13       DLH13
dph         DPH
dirichletlm DirichletLM
dl          Dl
hiemstra_lm Hiemstra_LM
ifb2        IFB2
inb2        InB2
inl2        InL2
in_expb2    In_expB2
in_expc2    In_expC2
js_kls      Js_KLs
lemurtf_idf LemurTF_IDF
mdl2        MDL2
ml2         ML2
pl2         PL2
tf_idf      TF_IDF
tf          Tf
xsqra_m     XSqrA_M
tmpl        TMPL

A full list of models available in Terrier-4.0 is to be found in the package org.terrier.matching.models:

http://terrier.org/docs/v4.0/javadoc/org/terrier/matching/models/package-summary.html

Modifying Models

TMPL.java is a no-frills version of a Java class that implements a retrieval model in Terrier-4.0. To implement a variation of BM25, say, BM25SIMPLE, you need to do the following:

Indexing

To index a TREC corpus, execute the Bash script with the following options:

bin/trec_terrier.sh -i
            -Dcollection.spec=[r]
        -Dterrier.index.path=[s]
        -Dstopwords.filename=[t]
        -Dtermpipelines=[u]
        -DTrecDocTags.doctag=[v]
        -DTrecDocTags.idtag=[w]
        -DTrecDocTags.process=[x]
        -DTrecDocTags.skip=[y]             
        -DTrecDocTags.casesensitive=[z]

where the configuration parameters are

[r] - A file listing the paths to each file of the corpus. Basically, this file contains the output of "find -L corpus/* -type f".

[s] - Path to the directory where the index will be put.

[t] - A file containing the stop-words.

[u] - s1,s2 (two comma-separated strings). 's1' is the string 'Stopwords', to tell Terrier to do 'stop-word removal' during indexing. 's2' is the Java class name of the stemmer. To switch off either of these two steps use the string 'NoOp' in place of s1 and s2. For example, 'NoOp,NoOp' means no stopping and no stemming.

[v] - The tag that encloses 'one' TREC document. This is not to be confused with the 'files' listed in the file specified using parameter [u]. Many TREC documents are kept concatenated in each of the files in that list, since it is inefficient to store millions of documents in as many files on disk.

[w] - The tag that enclosed the unique identifier of a TREC document. Ground-truth information maps queries to documents using these identifier strings. Queries have unique identifiers too.

[x] - The mark-up tags in documents whose content is considered to be part of Terrier's view of a document. Set this to "TEXT,TITLE,HEAD,HL" to index TREC CD 1-2, or, "TEXT,H3,DOC,TITLE,HEADLINE,TTL" to index TREC CD 4-5. To be able to index all the text within all the children of the 'DOC' tag, which is useful as a one-shot solution to index CD 1-5, leave [x] and [y] (mentioned next) blank. See the references below for sources of this information.

[y] - Comma-separated list of tags to ignore.

[z] - Set to 'false'

Retrieval

To index, run this script

bin/trec_terrier.sh -r
                    -q
                    -c i
                    -Dterrier.index.path=[j]
                    -Dtrec.topics=[k]
                    -DTrecQueryTags.doctag=[l]
                    -DTrecQueryTags.idtag=[m]
                    -DTrecQueryTags.process=[n]
                    -DTrecQueryTags.skip=[o]
                    -DTrecQueryTags.casesensitive=[p]
                    -Dstopwords.filename=[q]
                    -Dtermpipelines=[r]
                    -Dtrec.model=[s]
                    -Dquerying.postprocesses.controls=[t]
                    -Dquerying.postprocesses.order=[u]
                    -Dtrec.qe.model=[v]
                    -Dexpansion.terms=[w]
                    -Dexpansion.documents=[x]
                    -Dtrec.results=[y]
                    -Dtrec.results.file=[z]

where

[i] - A 'term-frequency normalization constant'. Voodoo! The table shows a range of values to pick for different similarity models and the parts of a TREC query; Title (T) and Description (D).

     BM25     PL2  LM-dirichlet  TF_IDF
     -----------------------------------
  T  0.3-0.5  4-7  750-1000      0.3-0.5
  D  0.6-0.8  1-2  1500-2000     0.6-0.8
     -----------------------------------

On the page below, scroll down to the 'Weighting Models and Parameters' section where this constant is mentioned:

http://terrier.org/docs/v4.0/configure_retrieval.html

[j] - Path to the directory that contains the index.

[k] - The file that contains the queries.

[l] - The tag that encloses each query in the file containing queries.

[m] - The tag that contains a query's unique identifier.

[n] - The tags to process. This is ugly. Terrier stipulates that you specify tags in [l] and [m] (see above) and once and again mention them here to tell the system to use its contents.

[o] - The parts of the query to skip, specified by a comma-separated list of the names of the tags that enclose those parts. It so happens that Terrier needs to be told simultaneously what to process and what to skip. process this, this and this, does not imply not processing the others. Terrier's documentation recommends specifying 'TOP,NUM,TITLE' for the '.process' tag and 'DESC,NARR' for the '.skip' tag in the situation where you only want to use the 'TITLE' part of a query.

[p] - false

[q] - A file containing the stop-words.

[r] - s1,s2 (Two comma-separated strings.) This is identical to the parameter [v] used for indexing.

[s] - The name of the retrieval model (i.e. the similarity algorithm).

[t] - "qe:QueryExpansion" To switch on query expansion. Voodoo.

[u] - "QueryExpansion" Some more voodoo.

[v] - The fully qualified Java class name of the query expansion algorithm. For example:

      org.terrier.matching.models.queryexpansion.Bo1.

[w] - The number of terms to add to a query to expand it.

[x] - Number of top documents from the first round of search results to use for computing query expansion statistics.

[y] - The directory where to place the files that contain the search results.

[z] - The name of the files that contain the ranked search result.

Configuration Pitfalls

Frequently Asked Questions

Q: Why has Terrier-4.0 been used; there are newer versions?

A: This exposition is confined to Terrier-4.0. Newer versions of Terrier have switched the build system to Maven and I don't have time to make that shift. Until a major Terrier release swings by, most of what I document here should remain relevant.

Q: Has this mod's version number got anything to do with Terrier's version number?

A: No. I have no plans to keep up with Terrier.