CLEMENTE, JHOIRENE. FINDING MOTIFS IN PARALLEL USING RANDOM PROJECTION ON GPUS (Under the direction of HENRY N. ADORNA, Ph.D.) Biological motifs are short patterns that have significant number of occurrences in the set of DNA sequences. These motifs are transcription binding sites that help regulate transcription and therefore gene expression. Detection of these patterns helps in gene function discovery and building regulatory networks. Mutations may occur at random positions of the genome and these patterns are also subject to modifications, making the problem more challenging. A variant called planted (l, d)-motif finding models the detection of these subtle motif patterns in the DNA. However, several algorithms used fail to recover most of the planted (l,d)-motifs. To address this problem, a hybrid algorithm was proposed in the literature which we will refer to as FMURP (Finding Motifs using Random Projection). It uses an initialization method called Projection to avoid being trapped in the local maxima and therefore increases the chance of getting the planted motifs. This algorithm is shown to have a high accuracy on solving motif finding and planted (l,d)-motif finding problem. This research presents a parallel algorithm and implementation of FMURP on Graphics Processing Units(GPUs) using CUDA. It also provides details on the implementation and optimizations done in GPU in order to minimize usage of space. The implementation called CUDA-FMURP was tested on randomly generated (l,d)-motif instances and is shown to have recovered majority of the planted motifs. It is also shown that CUDA-FMURP obtains a maximum speedup of 6.8 using a 512 core GPU with 2.0 compute capability.
Cluster Analysis for Identifying Genes Highly Correlated with a Phenotype
Jhoirene Clemente , Jan Michael Yap and Henry Adorna
Jhoirene Clemente . Cluster Analysis for Identifying Genes Highly Correlated with a Phenotype. (Under the direction of Jan Michael Yap) In this research, we perform cluster analysis of gene expression proﬁles extracted from 33 young breast cancer patients who developed distant metastasis in less than ﬁve years. The analysis aims to compare the cluster results made by performing Pearson’s Correlation, which partitions the set of gene transcripts into: a) directly affecting; b) independently affecting; or c) inversely affccting, on our trait of interest, and the cluster results made using our algorithm of choice which is the standard K Means algorithm, while taking into account the different distance measures (i.e. Euclidean, Squared Euclidean and Manhattan). The analysis includes cluster validation using visualization through vector fusion, and the Adjusted Rand Index to compare the result of the clustering made using K Means to the one made using Pearson’s Correlation. The Adjusted Rand Index showed that there is a low level of agreement between the two cluster results and, therefore, K Means Clustering is not a valid method to use instead of Pearson’s Correlation in identifying signiﬁcantly correlated gene transcripts, but the analysis showed that there is a signiﬁcant clustering of the identiﬁed signiﬁcantly correlated genes using K Means Clustering without the phenotypic value of the samples by comparing it to the result of random clustering which served as the null model.