R eval framework

Papers

Cross-collection comparisons

  1. Webber W, Moffat A, Zobel J. Score standardization for inter-collection comparison of retrieval systems. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 Jul 20 (pp. 51-58). ACM.

Summary

Notation
SxQ is a matrix of systems and queries on rows and columns respectively.
F = f(SxQ) generates standardization parameters from the matrix.  
g(F, SxQ) standardizes scores in the matrix.

The paper argues that score-standardization is insightful in that it helps compare the performance of systems on different query-sets by placing them on one scale. Standardization parameters are derived from a SxQ matrix, and applied to the matrix itself, or, some other matrix with a different S or Q.

The standardized score is the distance from the mean, in standard-deviation-units, of a system in a set of systems with respect to one query.

So, standardized scores in a SxQ matrix, show the relative performance of a set of systems on one query. This is equivalent to replacing scores with ranks that order the systems for a query.

The questions to answer are:

What it means to transfer parameters from SxQ to S'xQ', S'xQ or SxQ'.
What should the characteristics of the systems in S be?
How many systems should S have?

Experiment data

Corpus: TREC CD45-CR

Queries and systems:

-------------------
T6  T7  T8  R3  R4
-------------------
ST6 ST7 ST8 SR3 SR4
SR4 SR4 SR4 SR4
-------------------

KEY
T6, T7, T8 = Query sets from TREC6, TREC7, TREC8.
    R3, R4 = Query sets from the Robust tracks of TREC 2003 and 2004.
        S* = Set of systems (submitted runs) from corresponding tracks.

Experiments

Pij = A SxQ matrix formed by using a S from row i and Q column j.

Fij = f(Pij) is the set of standardization parameters derived from
      Pij.

g(Fij, Pkl) standardizes Pkl using Fij.

Combinations used to derive parameters from and apply to are:

ij -> ij
ij -> kj i != k
ij -> il j != l

Corresponding comparisons involve the pairs are of three types:

1. Pij <-> g(Fij, Pij)
2. Pij <-> g(Fij, Pkj) i != k (system-sets change)
3. Pij <-> g(Fij, Pil) j != l (query-sets change)

Where the operator <-> represents either of two comparison techniques;

What is termed as 'robustness' or 'longevity' of standardization parameters in Section 4 is defined as the behavior observed from comparison type 2. The meaning of this is that the ability of parameters derived from one set of systems on a query-set, to fruitfully standardize another set of systems on the same query-set, is being tested here.

Observation from comparison type 3 is termed 'cross-collection comparisons', which is the subject of Section 5. This is so because, in type 3, the behavior of a set of systems on two different query-sets is being studied.