On the Structure and Organization of TREC data
Test Collections
x query qrel corpus
----- ----------------------- ------ ------
task routing adhoc adhoc adhoc
----- ----------------------- ------ ------
TREC1 1-50 (cd2) 51-100 51-100 cd12
TREC2 51-100(cd3) 101-150 101-150 cd12
TREC3 101-150(cd3) 151-200 151-200 cd12
TREC4 201-250 201-250 cd23
TREC5 251-300 251-300 cd24
TREC6 301-350 301-350 cd45
TREC7 351-400 351-400 cd45-cr
TREC8 401-450 401-450 cd45-cr
Document Corpus
cd1
wsj WSJ 1987, 1988, 1989
fr FR 1989
ap AP 1989
doe DOE
ziff ZF 1989, 1990
cd2
wsj WSJ 1990, 1991, 1992
fr FR 1988
ap AP 1988
ziff ZF 1989, 1990
cd3
sjm SJM 1991
ap AP 1990
pat PT 1983-1991
ziff ZF 1991, 1992
cd4
ft FT 1991-1994
cr CR 1993
fr FR 1994
cd5
fbis FBIS 1996
lat LA 1989, 1990
Document Structure
These three seems to be around always. DOC, DOCNO, TEXT
A title shows up in many forms. TTL, TITLE, HEADLINE, H3, HT
Useful text blocks. SUMMARY
Some TEXT sections are strewn with funny comment tags and other tags too. 'within+' denotes such a TEXT section with one or more such tags within it.
TREC document structure table
cd1 cd2 cd3 cd4 cd5
doe
ap HEAD+ HEAD+ HEAD
fr within+ within+ within+
wsj HL HL
ziff TITLE TITLE TITLE
SUMMARY SUMMARY
patents TTL
sjm LEADPARA
SECTION
HEADLINE
cr TTL
ft HEADLINE
fbis H3 (within+)
HT (within+)
la HEADLINE
within+
Query Structure
YEAR/TAG head num dom title desc smry narr con fac nat def
1-100 x x x x x x x x x
101-150 x x x x x x x x x x
151-200 x x x x
201-250 x x
251-300 x x x x
301-350 x x x x
351-400 x x x x
401-450 x x x x
Empty Documents
Depending on how you configure a search engine's parser (which tag
contents to pick, etc.) documents may end up being empty, having no
usable content. I usually make parsers as liberal as possible so that
everything within a
So even after using the most liberal parser, there are some truly empty documents, and I have found 3 so far. Two of these fall prey to the tokenizer and stemmer;
File DOCNO
cd1/doe/doe1_096 DOE1-96-1081
cd1/doe/doe2_013 DOE2-13-0573
cd1/doe/doe2_051 DOE2-51-1160
cd1/doe/doe1_096
<DOC>
<DOCNO> DOE1-96-1081 </DOCNO>
<TEXT>
</TEXT>
</DOC>
cd1/doe/doe2_013
<DOC>
<DOCNO> DOE2-13-0573 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>
cd1/doe/doe2_051
<DOC>
<DOCNO> DOE2-51-1160 </DOCNO>
<TEXT>
None.
</TEXT>
</DOC>
Very Long Terms
Documents may have very long terms like this one from document LA072290-0141 in the CD5 LA Times sub-collection:
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
This is a name of a village in Wales; see the Wikipedia page about Llanfairpwllgwyngyll.
It is therefore recommended that you neither allocated just a small number of bytes for tokens or terms when building parsers, nor mistake such oddities as parser errors.