On the String Matching with k Differences in DNA Databases

Authors:

Yangjun Chen (University of Winnipeg), Hoang Hai Nguyen (University of Winnipeg)

Abstract

In this paper, we discuss an efficient and effective index mechanism for the string matching with \textit{k} differences, by which we will find all the substrings of a target string $y$ of length $n$ that align with a pattern string $x$ of length $m$ with not more than $k$ insertions, deletions, and mismatches. A typical application is the searching of a DNA database, where the size of a genome sequence in the database is much larger than that of a pattern. For example, $n$ is often on the order of millions or billions while $m$ is just a hundred or a thousand. The main idea of our method is to transform \textit{y} to a BWT-array as an index, denoted as \textit{BWT}(\textit{y}), and search \textit{x} against it. The time complexity of our method is bounded by O($k\cdot|T|$), where $T$ is a tree structure dynamically generated during a search of \textit{BWT}(\textit{y}). The average value of $|T|$ is bounded by O($|\Sigma|^2k}$), where $\Sigma$ is an alphabet from which we take symbols to make up target and pattern strings. This time complexity is better than previous strategies for the cases of $k$ $\leq$ O(log$_{|\Sigma|}n$). The general working process consists of two steps. In the first step, $x$ is decomposed into a series of $l$ small subpatterns, and \textit{BWT}($y$) is utilized to speed up the process to figure out all the occurrences of such subpatterns with $\lfloor$$k$/$l$$\rfloor$ differences. In the second step, all the found occurrences in the first step will be rechecked to see whether they really match $x$, but with $k$ differences. Extensive experiments have been conducted, which show that our method for this problem is promising.

PVLDB is part of the VLDB Endowment Inc.

Start

Current Submission

All Volumes

Reproducibility

General Information

Volume 14, No. 6

On the String Matching with k Differences in DNA Databases

Abstract