Datasets

 

September 8, 2012

Overview

Here is the datasets used in the experiment of

Xinhua Zhang, Ankan Saha, S. V. N. Vishwanathan

Smoothing Multivariate Performance Measures

Journal of Machine Learning Research (JMLR), 2012

All datasets are provided in two forms: LibSVM and PETSc binary.  It is an order of magnitude faster to load the data in PETSc format.  But to be consistent with common practice, we also provide all datasets in LibSVM format, which can be converted to the PETSc binary form by this tool.

The data files are hosted on SkyDrive.  A few notes:

  1. If you right click the folders and choose download, SkyDrive will zip up all files in the folder for downloading.
  2. To download any file that is larger than 100 MB, you will be asked to log into a Microsoft account.  Simply register one.
  3. Some feature files for training are larger than 2 GB, which exceeds the limit of SkyDrive.  So they are split into several files (see the suffix in the file names).  After downloading, they need to be concatenated by running "cat data.xz.* >> data.xz".
  4. You need the XZ Utils to decompress all the data files.  Very few datasets are compressed by bzip2 (for better compression rate), and can be decompressed by "tar -xjf fname.tar.bz2".
Let us know if any problem occurs when you download the files.

Download


In the last three columns, numbers in the parenthesis indicate the compressed file size in MB.

dataset name #train (*1000) #test (*1000) #feature training
feature file in PETSc
test
feature file in PETSc
training+test
feature file in LibSVM
adult9 33 16 123 Site 1 (1) Site 1 (1) Site 5 (1)
alpha 400 100 500 Site 1 (612) Site 1 (154) Site 5 (1,130)
astro-ph 62 32 99,757 Site 1 (38) Site 1 (20) Site 5 (65)
aut-avn 57 14 20,707 Site 1 (23) Site 1 (6) Site 5 (34)
beta 400 100 500 Site 1 (612) Site 1 (154) Site 5 (1,130)
covertype 523 58 6.27 M Site 1 (8) Site 1 (1) Site 5 (18)
delta 400 100 500 Site 1 (649) Site 1 (162) Site 5 (1,168)
dna 40,000 10,000 800 Site 2 (2,214) Site 2 (569) Site 9 (2,953)
dna_string 50,000 4,628 12.7 M NA NA Site 7 (764)
epsilon 400 100 2000 Site 2 (2,412) Site 2 (604) Site 6 (4,371)
fd 501 125 900 Site 1 (436) Site 1 (109) Site 5 (1,156)
gamma 400 100 500 Site 1 (638) Site 1 (160) Site 5 (1,161)
kdd99 4,898 311 127 Site 1 (18) Site 1 (1) Site 7 (22)
kdda 8,408 510 20.22 M Site 1 (124) Site 1 (9) Site 7 (308)
kddb 19,264 748 29.89 M Site 1 (213) Site 1 (9) Site 7 (314)
news20 16 4  7.26 M Site 1 (7) Site 1 (2) Site 7 (27)
ocr 2,800 700 1156 Site 4 (3,557) Site 4 (889) Site 10, 11 (10,066)
real-sim 58 14 2.97 M Site 1 (23) Site 1 (6) Site 7 (35)
reuters-c11 23 781 1.76 M Site 1 (14) Site 1 (452) Site 7 (492)
reuters-ccat 23 781 1.76 M Site 1 (14) Site 1 (452) Site 7 (492)
web8 46 14 300 Site 1 (1) Site 1 (1) Site 7 (1)
webspam-t 280 70 16.61 M Site 1 (1,106) Site 1 (276) Site 7 (3,500)
webspam-u 280 70 254 Site 1 (63) Site 1 (17) Site 7 (90)
worm 821 205 804 Site 1 (56) Site 1 (14) Site 7 (118)
zeta 400 100 800.4 M Site 3 (2,636) Site 3 (662) Site 8 (4,400)

Contacts

Xinhua Zhang | Ankan Saha | SVN Vishwanathan

References

[1]

Xinhua Zhang, Ankan Saha, S. V. N. Vishwanathan

Smoothing Multivariate Performance Measures

Journal of Machine Learning Research (submitted) [PDF]

Uncertainty in Artificial Intelligence (UAI), 2011.  [PDF

Last modified: 8 Sept, 2012