Similarity Detection for Higher-Order Structure of DNA Sequences
Abstract
With the advances in data collection and storage capabilities, large amount of multidimensional dataset, known as higher-order data representation, has been generated on bioinformatics applications recently, especially in DNA sequences recognition. This paper thus proposes a mathematical modeling could be capable of the multidimensional problem of DNA similarity detection with high accuracy and reliability. To this end, the paper covers the central issues of multidimensional DNA gene expression data, including: (1) formulating multidimensional DNA data into higher-order representation; (2) recovering missing values; (3) decomposing high-order DNA data directly from their tensorial representation to extracted useful information for classification. Consequently, an exploring a novel type of third-order microarray expression, termed as gene - sample - time (GST), is presented for biological sample classification. The contributions will be distributed along two main thrusts of effectiveness; including latent modeling setting for imputing missing values based on the High-Order Kalman Filter and feature extraction based on Tensor Discriminative Feature Extraction. The experimental performance on real dataset of DNA sequences corroborates the advantages of the proposed approaches upon those of the matrix-based algorithms and recent tensor-based, discriminant-decomposition, in terms of missing values completion, classification accuracy and computation time.