- Webber W, Moffat A, Zobel J. Score standardization for inter-collection comparison of retrieval systems. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 Jul 20 (pp. 51-58). ACM.
SxQ is a matrix of systems and queries on rows and columns respectively. F = f(SxQ) generates standardization parameters from the matrix. g(F, SxQ) standardizes scores in the matrix.
The paper argues that score-standardization is insightful in that it helps compare the performance of systems on different query-sets by placing them on one scale. Standardization parameters are derived from a SxQ matrix, and applied to the matrix itself, or, some other matrix with a different S or Q.
The standardized score is the distance from the mean, in standard-deviation-units, of a system in a set of systems with respect to one query.
So, standardized scores in a SxQ matrix, show the relative performance of a set of systems on one query. This is equivalent to replacing scores with ranks that order the systems for a query.
The questions to answer are:
What it means to transfer parameters from SxQ to S'xQ', S'xQ or SxQ'.
What should the characteristics of the systems in S be?
How many systems should S have?
Corpus: TREC CD45-CR Queries and systems: ------------------- T6 T7 T8 R3 R4 ------------------- ST6 ST7 ST8 SR3 SR4 SR4 SR4 SR4 SR4 ------------------- KEY T6, T7, T8 = Query sets from TREC6, TREC7, TREC8. R3, R4 = Query sets from the Robust tracks of TREC 2003 and 2004. S* = Set of systems (submitted runs) from corresponding tracks.
Pij = A SxQ matrix formed by using a S from row i and Q column j. Fij = f(Pij) is the set of standardization parameters derived from Pij. g(Fij, Pkl) standardizes Pkl using Fij.
Combinations used to derive parameters from and apply to are:
ij -> ij ij -> kj i != k ij -> il j != l
Corresponding comparisons involve the pairs are of three types:
1. Pij <-> g(Fij, Pij) 2. Pij <-> g(Fij, Pkj) i != k (system-sets change) 3. Pij <-> g(Fij, Pil) j != l (query-sets change)
Where the operator <-> represents either of two comparison techniques;
- Pearson's correlation
- Kendall's Tau for rank-correlation
What is termed as 'robustness' or 'longevity' of standardization parameters in Section 4 is defined as the behavior observed from comparison type 2. The meaning of this is that the ability of parameters derived from one set of systems on a query-set, to fruitfully standardize another set of systems on the same query-set, is being tested here.
Observation from comparison type 3 is termed 'cross-collection comparisons', which is the subject of Section 5. This is so because, in type 3, the behavior of a set of systems on two different query-sets is being studied.