# Papers

## Cross-collection comparisons

- Webber W, Moffat A, Zobel J. Score standardization for inter-collection comparison of retrieval systems. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 Jul 20 (pp. 51-58). ACM.

#### Summary

##### Notation

```
SxQ is a matrix of systems and queries on rows and columns respectively.
F = f(SxQ) generates standardization parameters from the matrix.
g(F, SxQ) standardizes scores in the matrix.
```

The paper argues that score-standardization is insightful in that it helps compare the performance of systems on different query-sets by placing them on one scale. Standardization parameters are derived from a SxQ matrix, and applied to the matrix itself, or, some other matrix with a different S or Q.

The standardized score is the distance from the mean, in standard-deviation-units, of a system in a set of systems with respect to one query.

So, standardized scores in a SxQ matrix, show the relative performance of a set of systems on one query. This is equivalent to replacing scores with ranks that order the systems for a query.

The questions to answer are:

What it means to transfer parameters from SxQ to S'xQ', S'xQ or SxQ'.

What should the characteristics of the systems in S be?

How many systems should S have?

#### Experiment data

```
Corpus: TREC CD45-CR
Queries and systems:
-------------------
T6 T7 T8 R3 R4
-------------------
ST6 ST7 ST8 SR3 SR4
SR4 SR4 SR4 SR4
-------------------
KEY
T6, T7, T8 = Query sets from TREC6, TREC7, TREC8.
R3, R4 = Query sets from the Robust tracks of TREC 2003 and 2004.
S* = Set of systems (submitted runs) from corresponding tracks.
```

#### Experiments

```
Pij = A SxQ matrix formed by using a S from row i and Q column j.
Fij = f(Pij) is the set of standardization parameters derived from
Pij.
g(Fij, Pkl) standardizes Pkl using Fij.
```

Combinations used to derive parameters from and apply to are:

```
ij -> ij
ij -> kj i != k
ij -> il j != l
```

Corresponding comparisons involve the pairs are of three types:

```
1. Pij <-> g(Fij, Pij)
2. Pij <-> g(Fij, Pkj) i != k (system-sets change)
3. Pij <-> g(Fij, Pil) j != l (query-sets change)
```

Where the operator <-> represents either of two comparison techniques;

- Pearson's correlation
- Kendall's Tau for rank-correlation

What is termed as 'robustness' or 'longevity' of standardization parameters in Section 4 is defined as the behavior observed from comparison type 2. The meaning of this is that the ability of parameters derived from one set of systems on a query-set, to fruitfully standardize another set of systems on the same query-set, is being tested here.

Observation from comparison type 3 is termed 'cross-collection comparisons', which is the subject of Section 5. This is so because, in type 3, the behavior of a set of systems on two different query-sets is being studied.