Testbeds

The University of Virginia Information Retrieval Group

About the Testbeds:
The testbed files describe the decomposition of documents into sites.
Each line in the file associates one document with its site.
The syntax is <document_id><site_id>.
Comments in the file are delimited by <COMMENT> at the beginning of the line.

The testbed files have all been compressed with gzip. In unix, some browsers drop the .gz suffix when the files are downloaded. In this case you must rename the file to include the suffix .gz so that gunzip will uncompress the file. In Windows, Winzip will properly uncompressthe gzipped file.

Testbed 1: (SYM-236) trec123-236-by_source-by_month

This test bed uses the data from TREC 1,2,3 disks.
DOE (disk 1) is not used at all, and the ZIFF documents on disk 3 are not used.
This test bed is designed so that each site represents one month of information from a given source.
There are 236 total site which vary in size considerably. Some of the PATN sites have < 10 documents, while some of the AP sites have > 8000 documents.
The different sources represent the following number of sites: AP - 35, FR - 22, PATN - 92, SJM - 12, WSJ - 51, ZIFF - 24.
GZipversion

 

Testbed 2: (UDC-236) trec123-236-eq_doc_counts

This test bed uses the data from TREC 1,2,3 disks.
DOE (disk 1) is not used at all, and the ZIFF documents on disk 3 are not used.
This test bed is designed so that all sites have about the same number of documents (~3000) and the total number of sites is the same as our original decomposition (236).
A given site contains documents from only one source, and is a continous stream of that source so that the sequential nature of the source is preserved.
The different sources represent the following number of sites: AP - 84, FR - 15, PATN - 2, SJM - 31, WSJ - 59, ZIFF - 45.
GZipversion

Testbed 3: (UBC-100) trec123-100-bysource-callan99.v2a

A 100 collection testbed created from TREC CDs 1, 2, and 3.
Testbed (offsite link)

Cyberia | People | Papers | Posters | Presentations | Testbed | Contact