TRECBOX

TRECBOX

This is a tool that provides an abstraction for specifying the index-retrieve-evaluate pipeline of a typical IR experiment. It drives other search systems on TREC data.

DOWNLOADS

Download the README.txt
Download the code: TRECBOX-1.0
Github repository: TRECBOX
Settings file: settings.txt
Experiment specs: exp-ltr.txt, exp-ttr.txt

TTR

TTR is Terrier-4.0 with some additions and modification for doing IR experiments using TREC data. The purpose of distributing this piece of software is to augment Terrier with better documentation. See NOTES.txt.

More often than not, search engines like these are used as black-boxes in experiments, and the lack of documentation describing the system-internals makes it hard to interpret the results or debug experiments. The collected notes here is an attempt to look under the hood and help the experimenter be a more informed user of this tool.

DOWNLOADS

Download the README.txt
Download the code: TTR-1.0
Github repository: TTR
Download the NOTES.txt
Download Terrier-4.0

TERRIER-4.0 REFERENCES
  1. Settings for indexing TREC CD 1 & 2.
  2. Settings for indexing TREC CD 4 & 5.
  3. Settings recommended for indexing all text within the DOC tag of a TREC document. See the Javadoc comment block preceding the 'TagSet' class definition.
  4. Stop word list.
  5. Stemmer implementations available.
  6. S-Stemmer implementation.
  7. A vague term frequency normalization constant mentioned in the 'Weighting Models and Parameters' section.
  8. org.terrier.matching.models; Terrier-4.0 model list.

LTR

LTR is a mod of Apache Lucene (5.4.0) for processing TREC data. A handful of classes were extended to implement TF-IDF models and provide the facility to parse TREC test-collections. NOTES.txt is a collection of notes on using Lucene.

DOWNLOADS

Download the README.txt
Download the code: LTR-1.0
Github repository: LTR
Download the NOTES.txt
Download Lucnene-4.0

LUCENE-5.4.0 REFERENCES
  1. org.apache.lucene.analysis.en; list of stemmers.
  2. org.apache.lucene.search.similarities; list of retrieval models.
  3. Lucene's scoring.
  4. NumericDocValue; The object that stores a per-document normalization factor.

Sample Data

TREC test-collection: ap.tgz

TEST-COLLECTION STATS
2250 Documents from the Associated Press (on TREC DISK 3).  
20   Queries from TREC-4 (Query IDs 201-250).  
167  Relevance judgments.
EVALUATION RESULTS
LTR
---------------------------------    
RUN                        MAP
---------------------------------
DEMO.a.s.bm25.20.D.x       0.4814
DEMO.a.s.bm25L.20.D.x      0.4335
DEMO.a.s.bm25e.20.D.x      0.4766
DEMO.a.s.tmpl.20.D.x       0.2402
DEMO.a.s.tmple.20.D.x      0.2402
---------------------------------

TTR
---------------------------------    
RUN                        MAP
---------------------------------
DEMO.a.s.bm25.20.D.x       0.4728
DEMO.a.s.tf_idf.20.D.x     0.4732
DEMO.a.s.tmpl.20.D.x       0.2141
---------------------------------