String kernels which compare the set of all common sub-strings between two given strings have recently been proposed by [1]. Surprisingly, these kernels can be computed in linear time and linear space using suffix trees. Even though, in theory, the suffix tree of an $n$ length string can be stored in $O(n)$ space, in practice, at least $20 n$ bytes are required. Furthermore, string kernel computation involves annotating the suffix tree which consumes an additional $20n$ bytes. Furthermore, suffix trees suffer from poor locality of memory access which deteriorates their performance on large strings. In this paper we describe a new scalable algorithm, based on suffix arrays, which requires only $21n$ bytes of storage to compute string kernels. Furthermore, our algorithm is faster and easier to implement, and has strong locality of memory reference properties. We also show how an extension of our algorithm can be used as a subroutine of a CG solver to train SVMs based on string kernels in sub-quadratic time. We also present preliminary experiments to validate our claims.
[1] | S. V. N. Vishwanathan & A. J. Smola, Fast Kernels for String and Tree Matching, Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004. [pdf] |
[2] | M. I. Abouelhoda, S. Kurtz & E. Ohlebusch, Replacing Suffix Trees with Enhanced Suffix Arrays, Journal of Discrete Algorithms, 2, 53 -- 86, 2004. |
[3] | M. A. Maniscalco & S. J. Puglisi, An Efficient, Versatile Approach to Suffix Sorting, Journal of Experimental Algorithmics (JEA), 12, 2007. |