Motif Finding on GPUs using Random Projections

CLEMENTE, JHOIRENE. FINDING MOTIFS IN PARALLEL USING RANDOM PROJECTION ON GPUS (Under the direction of HENRY N. ADORNA, Ph.D.) Biological motifs are short patterns that have significant number of occurrences in the set of DNA sequences. These motifs are transcription binding sites that help regulate transcription and therefore gene expression. Detection of these patterns helps in gene function discovery and building regulatory networks. Mutations may occur at random positions of the genome and these patterns are also subject to modifications, making the problem more challenging. A variant called planted (l, d)-motif finding models the detection of these subtle motif patterns in the DNA. However, several algorithms used fail to recover most of the planted (l,d)-motifs. To address this problem, a hybrid algorithm was proposed in the literature which we will refer to as FMURP (Finding Motifs using Random Projection). It uses an initialization method called Projection to avoid being trapped in the local maxima and therefore increases the chance of getting the planted motifs. This algorithm is shown to have a high accuracy on solving motif finding and planted (l,d)-motif finding problem. This research presents a parallel algorithm and implementation of FMURP on Graphics Processing Units(GPUs) using CUDA. It also provides details on the implementation and optimizations done in GPU in order to minimize usage of space. The implementation called CUDA-FMURP was tested on randomly generated (l,d)-motif instances and is shown to have recovered majority of the planted motifs. It is also shown that CUDA-FMURP obtains a maximum speedup of 6.8 using a 512 core GPU with 2.0 compute capability.


Analysis of Gene Expression Data

University of the Philippines Manila (UPM) invited us to talk about our research in Bioinformatics. I will discuss our research on visualization of yeast gene expression data.

Review on the paper entitled “A Fast File System for UNIX”

This is the review for the paper entitled “A Fast File System for Unix” by Marshall Kirk McKusick et. al. of the Computer Systems Research Group of University of California, Berkeley, rewritten in 1984.


The paper discussed about their reimplementation of UNIX file system where they adapt the system to a wide range of peripheral and processor characteristics. The new implementation of the system is tested to have ten times faster file access rate compared to the traditional UNIX file system. Several improvements on the file system were also discussed such as advisory locks on files, filename extension across file systems, ability to use long file names, and administrative control of resource usage.


The two major contributions  of this paper are the modifications in the file system organization.


  1. The first modification is on the storage utilization. The study optimized the storage utilization by increasing the block size. Through this bigger file can be transferred in a single disk transaction, thereby greatly increasing the throughput. However, Unix file system is composed of many small files, therefore large block size increases the space wasted. To resolve this issue. A single block is further partitioned into one or more addressable fragment. Since these fragments are addressable, multiple small files can reside on one data block.


  1. The second modification is on the file system reparameterization. This modification is essential to the perform an optimal configuration-dependent block allocations. Each file system used are parameterized and adapted to the type of disk where it is placed. The parameters used are the speed of the processor, the hardware support for mass storage transfers, and the characteristics of the mass storage devices.


Although this paper significantly improved the data transfer, the time to read and write the file is almost similar to the reading rate of the older file system. The writing rate of the new file system is 50% slower than the older file system, because the kernel has to do twice as many disk allocations per second.


Review on the paper entitled “A History and Evaluation of System R”

The paper  written by Donald Chamberlin et. al. under IBM Research Lab entitled “A History and Evaluation of System R” was published July 1981. In their paper, they discussed about an experimental database system called System R, where they demonstrate the usability advantages of the relational data model. It also discussed the lessons learned from the development of System R about the design of relational database systems.

The two major contributions  of this paper are the following.

1. The relational database system implementation (System R) where it provides a high-level non-navigational user interface that can support different types of rapidly changing database environments with concurrent users and has the capability to be in a consistent state after a failure.

2. To develop a fully operational database system, researchers followed a three phase program where in the first phase is the development of SQL. A high-level data sublanguage that the System R used to compile database queries into machine level codes.

Based from the experiments done in phase 2 of the System R project, the performance of relational database system is not yet equal to the navigational system where in paths and pointers are used to navigate among data nodes. However, the study hopes for more use of the system in the years ahead since relational database is more likely to be able to adapt to a broad spectrum of unanticipated applications.