Detection of cancer signatures by integrating different features from omics data combining multi-block and path modelling methods


Juan R. Gonzalez, Mikel Esnaola, Alejando Caceres, Carles Hernandez-Ferrer, Marcos Lopez-Sanchez Luis A. Perez-Jurado

Integrative and Computational Biology Joint Symposium


The main goal of personalized medicine is to use medical models to customize health care by clustering individuals based on particular profiles. The use of ‘omic’ data has played a major role in these profiling efforts. Data integration procedures are required before performing such clustering methods. However, data integration has to face the challenge of dealing with the huge dimensionality of multi-modal data while maximizing their individual information. Several techniques have been applied to reduce the dimension of the space of predictors in ‘omic’ data, but none of them guarantees the interpretability of the results, or even biological sense of the latent variables. To address this limitation, we propose to reduce the dimensionality by extracting the relevant features from each ‘omic’ data. For instance, genomic data can provide additional information about gene-sets, point mutations, copy number variants, inversions and mosaic events. Alternatively, RNA-seq data can inform about global transcriptomic expression and alternative splicing events. Once data are summarized, we use a novel approach to perform data integration called regularized generalized canonical correlation. This method incorporates two different ways of combining datasets. Fist, it provides a consensus between tables (e.g. similar to PCA for more than two tables). Second, it includes information about how datasets are linked (i.e. data at different time-points, causality relationships,…). Outputs are then used to: 1) describe profiles of individuals having similar phenotypes to define “disease signatures”, and 2) determine those individuals with similar patterns which may help to stratify the population or describe new phenotypes. Initial results include analysis on data from the TCGA (The Cancer Genome Atlas) project including genomic, epigenomic, transcriptomic and clinical information, to study the determinants of cancer prognosis. Our approach finds biomarkers that correlate with complex combinations of different features obtained from ‘omic’ data that improves individuals’ prognosis prediction.

Carles Hernandez-Ferrer

Carles Hernandez-Ferrer

Bioinformatics, data analysis and software development

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora