Semi-automated framework for the analytical use of gene-centric data with biological ontologies
MetadataShow full item record
Motivation Translational bioinformatics(TBI) has been defined as ‘the development and application of informatics methods that connect molecular entities to clinical entities’ , which has emerged as a systems theory approach to bridge the huge wealth of biomedical data into clinical actions using a combination of innovations and resources across the entire spectrum of biomedical informatics approaches . The challenge for TBI is the availability of both comprehensive knowledge based on genes and the corresponding tools that allow their analysis and exploitation. Traditionally, biological researchers usually study one or only a few genes at a time, but in recent years high throughput technologies such as gene expression microarrays, protein mass-spectrometry and next-generation DNA and RNA sequencing have emerged that allow the simultaneous measurement of changes on a genome-wide scale. These technologies usually result in large lists of interesting genes, but meaningful biological interpretation remains a major challenge. Over the last decade, enrichment analysis has become standard practice in the analysis of such gene lists, enabling systematic assessment of the likelihood of differential representation of defined groups of genes compared to suitably annotated background knowledge. The success of such analyses are highly dependent on the availability and quality of the gene annotation data. For many years, genes were annotated by different experts using inconsistent, non-standard terminologies. Large amounts of variation and duplication in these unstructured annotation sets, made them unsuitable for principled quantitative analysis. More recently, a lot of effort has been put into the development and use of structured, domain specific vocabularies to annotate genes. The Gene Ontology is one of the most successful examples of this where genes are annotated with terms from three main clades; biological process, molecular function and cellular component. However, there are many other established and emerging ontologies to aid biological data interpretation, but are rarely used. For the same reason, many bioinformatic tools only support analysis analysis using the Gene Ontology. The lack of annotation coverage and the support for them in existing analytical tools to aid biological interpretation of data has become a major limitation to their utility and uptake. Thus, automatic approaches are needed to facilitate the transformation of unstructured data to unlock the potential of all ontologies, with corresponding bioinformatics tools to support their interpretation. Approaches In this thesis, firstly, similar to the approach in [3,4], I propose a series of computational approaches implemented in a new tool OntoSuite-Miner to address the ontology based gene association data integration challenge. This approach uses NLP based text mining methods for ontology based biomedical text mining. What differentiates my approach from other approaches is that I integrate two of the most wildly used NLP modules into the framework, not only increasing the confidence of the text mining results, but also providing an annotation score for each mapping, based on the number of pieces of evidence in the literature and the number of NLP modules that agreed with the mapping. Since heterogeneous data is important in understanding human disease, the approach was designed to be generic, thus the ontology based annotation generation can be applied to different sources and can be repeated with different ontologies. Secondly, in respect of the second challenge proposed by TBI, to increase the statistical power of the annotation enrichment analysis, I propose OntoSuite-Analytics, which integrates a collection of enrichment analysis methods into a unified open-source software package named topOnto, in the statistical programming language R. The package supports enrichment analysis across multiple ontologies with a set of implemented statistical/topological algorithms, allowing the comparison of enrichment results across multiple ontologies and between different algorithms. Results The methodologies described above were implemented and a Human Disease Ontology (HDO) based gene annotation database was generated by mining three publicly available database, OMIM, GeneRIF and Ensembl variation. With the availability of the HDO annotation and the corresponding ontology enrichment analysis tools in topOnto, I profiled 277 gene classes with human diseases and generated ‘disease environments’ for 1310 human diseases. The exploration of the disease profiles and disease environment provides an overview of known disease knowledge and provides new insights into disease mechanisms. The integration of multiple ontologies into a disease context demonstrates how ‘orthogonal’ ontologies can lead to biological insight that would have been missed by more traditional single ontology analysis.