txt
This is a C implementation of a text search engine for use in information retrieval experiments.
Addressing the issue of repeatability and reproducibility of experiments was a motivation for writing txt. In my experience, a factor that makes it hard to repeat or reproduce experiments is the practice of using a search system as a black box. Abstruse documentation accompanying the software do not help either. It is possible that because of some of these reasons experiments reported in research papers ignore details of the experimental setup.
txt is designed to be a system that is fast, introspective1 and comprehensible2. Hopefully, txt will help the student and practitioner design and perform experiments that are scientifically rigorous, gain an understanding of the inner working of a search system and establish a strong foundations of learning from experimentation.
1 In that it can record statistics about its own workings.
2 It comes with well-written documentation that also explains the design.
Prerequisite
Install the C library libk.
Compile
make
Tokenize Input Text
./raw2t -x -n -c TRECQUERY <q.txt >q.t
./raw2t -x -n -c TREC <d.txt >d.t
where
- TRECQUERY/TREC - Indicates the kind of parser to use.
- -x - Exclude words that are shorter or longer than a limit.
- -n - Normalize a word, that is, reduce them to their stem or lemma.
- -c - Exclude common words that occur in English text with a high frequency.
input
output
- d.t - Documents in tokenized form (a binary file).
- q.t - Query file in tokenized form.
- vocab.txt - The vocabulary of the document set, or, in other words, the set of unique word.
- docid.txt - The docid-to-docid mapping. The input documents have unique identifiers which are mapped to a set of unique identifiers (non-zero positive integers in this case) that txt uses.
Prettyprint the tokenized file
./t2mem <d.t >d.mem
./t2mem <q.t >q.mem
where d.mem is the prettyprinted view of d.t and so is q.mem.
Index & Retrieve
./ii -s q.t <d.t >res.txt
Builds the inverted index from tokenized corpus and retrieves documents (-s, as in search) using tokenized queries and documents as input.
res.txt lists n documents retrieved per query, along with the similarity score for each query-document pair.
Rank the Search Result
sort -k1,1 -k3,3nr res.txt >rank.txt
rank.txt sorts res.txt by the score, within each query.
Convert the Result
awk -f txt2trecrun.awk <rank.txt >run.txt
This step converts the ranked search result to a standard format in run.txt that evaluation tools, commonly used in the IR community, can read.
Evaluate
trec_eval -q qrel.txt run.txt >eval.txt
trec_eval is a popular tool to evaluate a retrieval run (a search result) where the input is the query-document relevance judgments qrel.txt and the pre-formatted run file. The output is the evaluation result, where the quality of the search result is quantified using a large number of metrics as shown in eval.txt.