The University of Virginia Information Retrieval Group
About the Testbeds:
The testbed files describe the decomposition of documents into sites.
Each line in the file associates one document with its site.
The syntax is <document_id><site_id>.
Comments in the file are delimited by <COMMENT> at the beginning of the line.
The testbed files have all been compressed with gzip. In unix, some
browsers drop the .gz suffix when the files are downloaded.
In this case you must rename the file to include the suffix .gz so that gunzip will
uncompress the file. In Windows, Winzip will properly uncompressthe gzipped file.
This test bed uses the data from TREC 1,2,3 disks.
DOE (disk 1) is not used at all, and the ZIFF documents
on disk 3 are not used.
This test bed is designed so that each site represents
one month of information from a given source.
There are 236 total site which vary in size considerably.
Some of the PATN sites have < 10 documents, while some
of the AP sites have > 8000 documents.
The different sources represent the following number of
sites: AP - 35, FR - 22, PATN - 92, SJM - 12, WSJ - 51,
ZIFF - 24.
This test bed uses the data from TREC 1,2,3 disks.
DOE (disk 1) is not used at all, and the ZIFF documents
on disk 3 are not used.
This test bed is designed so that all sites have about
the same number of documents (~3000) and the total number
of sites is the same as our original decomposition (236).
A given site contains documents from only one source, and
is a continous stream of that source so that the
sequential nature of the source is preserved.
The different sources represent the following number of
sites: AP - 84, FR - 15, PATN - 2, SJM - 31, WSJ - 59,
ZIFF - 45.