Here is the datasets used in the experiment of
Xinhua Zhang, Ankan Saha, S. V. N. Vishwanathan
Smoothing Multivariate Performance Measures
Journal of Machine Learning Research (JMLR), 2012
All datasets are provided in two forms: LibSVM and PETSc binary. It is an order of magnitude faster to load the data in PETSc format. But to be consistent with common practice, we also provide all datasets in LibSVM format, which can be converted to the PETSc binary form by this tool.
The data files are hosted on SkyDrive. A few notes:
dataset name | #train (*1000) | #test (*1000) | #feature |
training feature file in PETSc |
test feature file in PETSc |
training+test feature file in LibSVM |
adult9 | 33 | 16 | 123 | Site 1 (1) | Site 1 (1) | Site 5 (1) |
alpha | 400 | 100 | 500 | Site 1 (612) | Site 1 (154) | Site 5 (1,130) |
astro-ph | 62 | 32 | 99,757 | Site 1 (38) | Site 1 (20) | Site 5 (65) |
aut-avn | 57 | 14 | 20,707 | Site 1 (23) | Site 1 (6) | Site 5 (34) |
beta | 400 | 100 | 500 | Site 1 (612) | Site 1 (154) | Site 5 (1,130) |
covertype | 523 | 58 | 6.27 M | Site 1 (8) | Site 1 (1) | Site 5 (18) |
delta | 400 | 100 | 500 | Site 1 (649) | Site 1 (162) | Site 5 (1,168) |
dna | 40,000 | 10,000 | 800 | Site 2 (2,214) | Site 2 (569) | Site 9 (2,953) |
dna_string | 50,000 | 4,628 | 12.7 M | NA | NA | Site 7 (764) |
epsilon | 400 | 100 | 2000 | Site 2 (2,412) | Site 2 (604) | Site 6 (4,371) |
fd | 501 | 125 | 900 | Site 1 (436) | Site 1 (109) | Site 5 (1,156) |
gamma | 400 | 100 | 500 | Site 1 (638) | Site 1 (160) | Site 5 (1,161) |
kdd99 | 4,898 | 311 | 127 | Site 1 (18) | Site 1 (1) | Site 7 (22) |
kdda | 8,408 | 510 | 20.22 M | Site 1 (124) | Site 1 (9) | Site 7 (308) |
kddb | 19,264 | 748 | 29.89 M | Site 1 (213) | Site 1 (9) | Site 7 (314) |
news20 | 16 | 4 | 7.26 M | Site 1 (7) | Site 1 (2) | Site 7 (27) |
ocr | 2,800 | 700 | 1156 | Site 4 (3,557) | Site 4 (889) | Site 10, 11 (10,066) |
real-sim | 58 | 14 | 2.97 M | Site 1 (23) | Site 1 (6) | Site 7 (35) |
reuters-c11 | 23 | 781 | 1.76 M | Site 1 (14) | Site 1 (452) | Site 7 (492) |
reuters-ccat | 23 | 781 | 1.76 M | Site 1 (14) | Site 1 (452) | Site 7 (492) |
web8 | 46 | 14 | 300 | Site 1 (1) | Site 1 (1) | Site 7 (1) |
webspam-t | 280 | 70 | 16.61 M | Site 1 (1,106) | Site 1 (276) | Site 7 (3,500) |
webspam-u | 280 | 70 | 254 | Site 1 (63) | Site 1 (17) | Site 7 (90) |
worm | 821 | 205 | 804 | Site 1 (56) | Site 1 (14) | Site 7 (118) |
zeta | 400 | 100 | 800.4 M | Site 3 (2,636) | Site 3 (662) | Site 8 (4,400) |
[1] |
Xinhua Zhang, Ankan Saha, S. V. N. Vishwanathan Smoothing Multivariate Performance Measures Journal of Machine Learning Research (submitted) [PDF] Uncertainty in Artificial Intelligence (UAI), 2011. [PDF] |
Last modified: 8 Sept, 2012