Cluster Analysis validated through Adjusted Rand Index

Cluster Analysis for Identifying Genes Highly Correlated with a Phenotype

Jhoirene Clemente , Jan Michael Yap and Henry Adorna


Jhoirene Clemente . Cluster Analysis for Identifying Genes Highly  Correlated with a Phenotype. (Under the direction of Jan Michael Yap) In this research, we perform cluster analysis of gene expression profiles extracted from 33 young breast cancer patients who developed distant metastasis in less than five years. The analysis aims to compare the cluster results made by performing Pearson’s  Correlation, which partitions the set of gene transcripts into: a) directly affecting; b) independently affecting; or c) inversely affccting, on our trait of interest, and the cluster results made using our algorithm of choice which is the standard K Means algorithm, while taking into account the different distance measures (i.e. Euclidean, Squared Euclidean and Manhattan). The analysis includes cluster validation using visualization through vector fusion, and the Adjusted Rand Index to compare the result of the clustering made using K Means to the one made using Pearson’s Correlation. The Adjusted Rand Index showed that there is a low level of agreement between the two cluster results and, therefore, K Means Clustering is not a valid method to use instead of Pearson’s Correlation in identifying significantly correlated gene transcripts, but the analysis showed that there is a significant clustering of the  identified significantly correlated genes using K Means Clustering without the phenotypic value of the samples by comparing it to the result of random clustering which  served as the null model.


Download full pdf