(Suffix Arrays based String Kernels)

Version: 1.1.1

Date: 03-09-2013


String kernels which compare the set of all common sub-strings between two given strings have recently been proposed by [1]. Surprisingly, these kernels can be computed in linear time and linear space using suffix trees. Even though, in theory, the suffix tree of an $n$ length string can be stored in $O(n)$ space, in practice, at least $20 n$ bytes are required. Furthermore, string kernel computation involves annotating the suffix tree which consumes an additional $20n$ bytes. Furthermore, suffix trees suffer from poor locality of memory access which deteriorates their performance on large strings. In this paper we describe a new scalable algorithm, based on suffix arrays, which requires only $21n$ bytes of storage to compute string kernels. Furthermore, our algorithm is faster and easier to implement, and has strong locality of memory reference properties. We also show how an extension of our algorithm can be used as a subroutine of a CG solver to train SVMs based on string kernels in sub-quadratic time. We also present preliminary experiments to validate our claims.


C. H. Teo & S. V. N. Vishwanathan, Fast and Space Efficient String Kernels using Suffix Arrays, ICML, 2006. [pdf]



SASK is licensed under Mozilla Public License version 2.0. The authors are not responsible for any implications from the use of the software.




We would like to thank M. I. Abouelhoda, M. A. Maniscalco and S. J. Puglisi for fruitful discussions on the Enhanced Suffix Arrays [2] and Suffix Array sorting method, MSufSort [2], during the development of the software.


[1] S. V. N. Vishwanathan & A. J. Smola, Fast Kernels for String and Tree Matching, Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004. [pdf]
[2] M. I. Abouelhoda, S. Kurtz & E. Ohlebusch, Replacing Suffix Trees with Enhanced Suffix Arrays, Journal of Discrete Algorithms, 2, 53 -- 86, 2004.
[3] M. A. Maniscalco & S. J. Puglisi, An Efficient, Versatile Approach to Suffix Sorting, Journal of Experimental Algorithmics (JEA), 12, 2007.