go back
go back
Volume 14, No. 6
On the String Matching with k Differences in DNA Databases
Abstract
In this paper, we discuss an efficient and effective index mechanism for the string matching with \textit{k} differences, by which we will find all the substrings of a target string $y$ of length $n$ that align with a pattern string $x$ of length $m$ with not more than $k$ insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, $n$ is often on the order of millions or billions while $m$ is just a hundred or a thousand. The main idea of our method is to transform \textit{y} to a BWT-array as an index, denoted as \textit{BWT}(\textit{y}), and search \textit{x} against it. The time complexity of our method is bounded by O($k\cdot|T|$), where $T$ is a tree structure dynamically generated during a search of \textit{BWT}(\textit{y}). The average value of $|T|$ is bounded by O($|\Sigma|^2k}$), where $\Sigma$ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies for the cases of $k$ $\leq$ O(log$_{|\Sigma|}n$). The general working process consists of two steps. In the first step, $x$ is decomposed into a series of $l$ small subpatterns, and \textit{BWT}($y$) is utilized to speed up the process to figure out all the occurrences of such subpatterns with $\lfloor$$k$/$l$$\rfloor$ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match $x$, but with $k$ differences. Extensive experiments have been conducted, which show that our method for this problem is promising.
PVLDB is part of the VLDB Endowment Inc.
Privacy Policy