SASK

(Suffix Arrays based String Kernels)

Version: 1.1.1

Date: 03-09-2013

Abstract

String kernels which compare the set of all common sub-strings between two given strings have recently been proposed by [1]. Surprisingly, these kernels can be computed in linear time and linear space using suffix trees. Even though, in theory, the suffix tree of an $n$ length string can be stored in $O(n)$ space, in practice, at least $20 n$ bytes are required. Furthermore, string kernel computation involves annotating the suffix tree which consumes an additional $20n$ bytes. Furthermore, suffix trees suffer from poor locality of memory access which deteriorates their performance on large strings. In this paper we describe a new scalable algorithm, based on suffix arrays, which requires only $21n$ bytes of storage to compute string kernels. Furthermore, our algorithm is faster and easier to implement, and has strong locality of memory reference properties. We also show how an extension of our algorithm can be used as a subroutine of a CG solver to train SVMs based on string kernels in sub-quadratic time. We also present preliminary experiments to validate our claims.

Paper

C. H. Teo & S. V. N. Vishwanathan, Fast and Space Efficient String Kernels using Suffix Arrays, ICML, 2006. [pdf]

Software

sask-1.1.1 (NOTE: Please choose the right compilation flag in the Makefile according to the endianness of your machine)
To reproduce the experimental results reported in the paper you can download this particular version of SASK source code and dataset. SASK experiments (NOTE: the suffix sorting library (MSufSort version 2.2) used in this source code does not work properly on big endian machine such as PowerPC! Please run it on little endian one.)

Disclaimer

SASK is licensed under Mozilla Public License version 2.0. The authors are not responsible for any implications from the use of the software.

History

v1.1

Changed license from MPL 1.1 to MPL 2.0.
Removed license header from source code files.

Acknowledgement

We would like to thank M. I. Abouelhoda, M. A. Maniscalco and S. J. Puglisi for fruitful discussions on the Enhanced Suffix Arrays [2] and Suffix Array sorting method, MSufSort [2], during the development of the software.

References

[1]	S. V. N. Vishwanathan & A. J. Smola, Fast Kernels for String and Tree Matching, Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004. [pdf]
[2]	M. I. Abouelhoda, S. Kurtz & E. Ohlebusch, Replacing Suffix Trees with Enhanced Suffix Arrays, Journal of Discrete Algorithms, 2, 53 -- 86, 2004.
[3]	M. A. Maniscalco & S. J. Puglisi, An Efficient, Versatile Approach to Suffix Sorting, Journal of Experimental Algorithmics (JEA), 12, 2007.