txt

This is a C implementation of a text search engine for use in information retrieval experiments.

Addressing the issue of repeatability and reproducibility of experiments was a motivation for writing txt. In my experience, a factor that makes it hard to repeat or reproduce experiments is the practice of using a search system as a black box. Abstruse documentation accompanying the software do not help either. It is possible that because of some of these reasons experiments reported in research papers ignore details of the experimental setup.

txt is designed to be a system that is fast, introspective¹ and comprehensible². Hopefully, txt will help the student and practitioner design and perform experiments that are scientifically rigorous, gain an understanding of the inner working of a search system and establish a strong foundations of learning from experimentation.

1 In that it can record statistics about its own workings.
2 It comes with well-written documentation that also explains the design.

Prerequisite

Install the C library libk.

Compile

make

Tokenize Input Text

./raw2t -x -n -c TRECQUERY <q.txt >q.t
./raw2t -x -n -c TREC <d.txt >d.t

where

TRECQUERY/TREC - Indicates the kind of parser to use.
-x - Exclude words that are shorter or longer than a limit.
-n - Normalize a word, that is, reduce them to their stem or lemma.
-c - Exclude common words that occur in English text with a high frequency.

input

d.txt - Documents (marked-up, concatenated).
q.txt - Query file.

output

d.t - Documents in tokenized form (a binary file).
q.t - Query file in tokenized form.
vocab.txt - The vocabulary of the document set, or, in other words, the set of unique word.
docid.txt - The docid-to-docid mapping. The input documents have unique identifiers which are mapped to a set of unique identifiers (non-zero positive integers in this case) that txt uses.

Prettyprint the tokenized file

./t2mem <d.t >d.mem
./t2mem <q.t >q.mem

where d.mem is the prettyprinted view of d.t and so is q.mem.

Index & Retrieve

./ii -s q.t <d.t >res.txt

Builds the inverted index from tokenized corpus and retrieves documents (-s, as in search) using tokenized queries and documents as input.

res.txt lists n documents retrieved per query, along with the similarity score for each query-document pair.

Rank the Search Result

sort -k1,1 -k3,3nr res.txt >rank.txt

rank.txt sorts res.txt by the score, within each query.

Convert the Result

awk -f txt2trecrun.awk <rank.txt >run.txt

This step converts the ranked search result to a standard format in run.txt that evaluation tools, commonly used in the IR community, can read.

Evaluate

trec_eval -q qrel.txt run.txt >eval.txt

trec_eval is a popular tool to evaluate a retrieval run (a search result) where the input is the query-document relevance judgments qrel.txt and the pre-formatted run file. The output is the evaluation result, where the quality of the search result is quantified using a large number of metrics as shown in eval.txt.

The Writing Desk "Why is a raven like a writing-desk?"