Writings
- Using Negative Information in Search February 2011
- Simple Transliteration for CLIR 2013
- Overview of FIRE 2011 2013
- Forum for Information Retrieval Evaluation November 2013
- A Method for Cross-collection Comparison November 2014
- Terrier Notes 2016
- Lucene Notes 2016
- Black Boxes are Harmful September 2016
- On tf-idf 2016
Using Negative Information in Search (with Sukomal Pal & Mandar Mitra) February 2011
Second International Conference on Emerging Applications of Information Technology, IEEE (February 2011), 53-56.
This was my introduction to the methods of setting up and running information retrieval experiments, and, writing a academic paper. I wrote this after I started working as a programmer at I.S.I. Kolkata's Information Retrieval Laboratory. I would then continue to work with IR experiments for the next six years.
Simple Transliteration for CLIR (with Prasenjit Majumder) 2013
Multilingual Information Access in South Asian Languages. Springer, Berlin, Heidelberg (2013), 241-251.
The acronym 'CLIR' in the title stands for 'Cross Lingual Information Retrieval'. This paper accompanied a set of IR experiments (search results) submitted for the workshop titled 'Forum for Information Retrieval Evaluation' (FIRE) held annually for the IR community in India. The IR Lab at I.S.I. Kolkata, where I worked, was also a co-organizer. One of FIRE's responsibilities is curating data sets in various Indian languages for IR research. At the annual workshop students and researchers share their experimental results produced using these data sets.
Overview of FIRE 2011 (with Prasenjit Majumder, Dipasree Pal, Ayan Bandyopadhyay, Mandar Mitra) 2011
Multilingual Information Access in South Asian Languages, Springer, Berlin, Heidelberg, (2013) 1-12.
The overview paper, written after the FIRE workshop every year, summarizes the year's participation statistics, research directions and submission.
Forum for Information Retrieval Evaluation November 2013
Text REtrieval Conference (TREC), National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA. (November 2012)
Poster PDF
In 2012 I moved from India to U.S.A. to work in NIST's Retrieval Group. Like I.S.I.'s FIRE back in India, NIST's TREC was the annual meeting for IR researchers to share and study text retrieval methodologies. At its poster session I put up this poster to spread the word about FIRE.
A Method for Cross-collection Comparison (with Donna Harman & Ian Soboroff) November 2014
Text REtrieval Conference (TREC), National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA. (November 2014)
Poster PDF
This poster was a preview of a project on studying IR experiment using Meta-analysis, a technique of applying inferential statistics. While working on this I took a detour to adress the issues with reproducing IR experiments.
This is my collection of notes on using Terrier-4.0 for IR experiments that clarifies some things hard to understand from documentation accompanying the software. Some of the bugs and pitfalls I point out are specific to Terrier-4.0. I wrote this in the summer of 2016 and since then newer versions of the software have appeared. Hopefully they don't exist any more. However, I still think, these notes will continue to help disambiguate parts of the documentation.
Together with Terrier Notes, these notes were written for the Lucene search system I was working with at the time.
Black Boxes are Harmful September 2016
Lucene4IR Workshop, University of Strathclyde, Glasgow, UK (8-9 September 2016). Report on the Lucene4IR Workshop, SIGIR Forum, 50, 2 (December 2016), 58-75.
By this time I was having qualms about the correctness of results produced by the search engines in use in the IR research community. Talking to people I found that there were word-of-mouth heuristics they relied upon to set up their experiment pipelines. Most of this detail was usually omitted in papers reporting experiments. In looking at the code, I found a incorrect implementation of the BM25 term-weighting equation in Terrier-4.0, a search engines used by many researchers. Researchers were also unaware of the optimizations in Apache's Lucene software that changed a document collection's document length distribution. This lead me to spend the rest of my time in IR on repeatability and reproducibility.
Markdown document | Tables: SMART Notation PDF | Okapi BM Variants PDF | BM Constants PDF
I left NIST in August 2017, went to back to Kolkata and then moved to Vancouver in February 2018. My time in IR had concluded and I would move onto computer graphics and systems programming, a long cherished thought that I had not devote enough time to in the years past. As closure, I collected these notes and software related to the issue of repeatability and reproducibility in IR. In these notes I trace the provenance of term-weighting equations (tf-idf equations in IR parlance) to put to rest for once and for all any ambiguity about their structure and form. Some of the tables I used in the article are provided as separate PDF documents.