LBGI : le Laboratoire de BioInformatique et Génomique Intégratives
- Responsable Olivier Poch
- Les Membres du LBGI (25 personnes dont 10 titulaires, 8 cdd et 7 doctorants)
- Le LBGI fait parti du Laboratoire ICube (CNRS UMR 7357)
- Organigramme du LBGI
Le but de notre groupe de recherche est de développer des méthodes validées en informatique pour la biologie à haut débit de manière à étudier des systèmes biologiques allant des familles de protéines jusqu'aux systèmes relationnels tels que les "hyperstructures" (complexes macro-moléculaires, organelles, virus,...) ou les réseaux biologiques (métabolosome, réseaux de transcription, de développement ou liés à des maladies,...).
Deux stratégies sont développées dans notre laboratoire pour aborder ces objectifs :
- la bio-informatique pour le développement d'algorithmes originaux et d'une plate-forme intégrée.
- la bio-analyse dédiée à l'analyse approfondie de systèmes biologiques spécifiques pour l'identification de cibles thérapeutiques potentielles intéressantes et l'amélioration des algorithmes et des stratégies informatiques. Dans ce contexte, la génomique fonctionnelle du cancer ou de maladies humaines fournit des systèmes inestimables associant la disponibilité de nombreuses données génomiques et fonctionnelles provenant de patients humains et l'existence de mutations spécifiques dans des modèles animaux avec des données correspondantes de transcriptomique.
Evolutionary Histories of the HUman Proteome The goal of our project is the definition of a complete set of the evolutionary histories (cascade of phylogenetic events) for the human proteome and their genome-scale analysis.
Présentation en anglais
The aim of our research group is to develop validated high throughput computational biology to study the behaviour of biological systems ranging from protein families to relational systems such as “hyperstructures” (macromolecular complex, organelles, viruses…) or biological networks (metabolic, transcriptional, interactomic as well as developmental or disease-related networks…). To tackle these objectives, two complementary strategies are developed in our laboratory :
- a bioinformatics approach is used to develop original algorithms and to construct integrated platforms and relational databases
- a bioanalysis approach dedicated to the in depth analysis of specialized biological systems to identify interesting biological targets and to validate and refine efficient algorithms or computational strategies.
In this context, functional and comparative genomics of cancer or human illness represent invaluable experimental systems, combining the availability of numerous genomic and functional data from human patients and the existence of specific mutations in animal models with the respective transcriptomic data available.
Bioinformatics : development of software and databases
Following the development of the PipeAlign cascade of programs aimed at the automated construction and evaluation of high-quality, reliable hierarchized Multiple Alignment of Complete Sequences (MACS), new algorithms and approaches to understand and exploit the relationships existing between protein sequence, structure, function and evolution have been developed. This includes the design of the Multiple Alignment Ontology (MAO) dedicated to the formalisation of the conservation and evolutionary information and the creation of an integrated Information Management System (MACSIMS), based on the data model embodied in MAO. MACSIMS facilitates knowledge extraction, comparison, evaluation and validation as well as presentation of the most pertinent information to the biologist and allows the definition of the presence/absence status as well as of the hierarchical relationships between and within sequence subfamilies at the complete sequence, domain or single residue levels. MAO and MACSIMS have been central to the development of numerous algorithms and/or web servers for the exploitation of efficient and automated phylogenetic inference in the fields of sequence validation (vALId), automated up-date of functional family-specific multiple alignments (DbW), sequence annotation (GOAnno), comparative genomics (ARPAnno), 3D modelling (MAGOS) and promoter analysis (PromAn). All these developments are integrated through the inhouse-developed GScope bioinformatics platform which is optimized for the automatic treatment and exploitation of large-scale datasets. In addition, the phylogenetic inference reasoning provided by MACSIMS has been exploited in the new version of our multiple alignment benchmark, BAliBASE3, which provides high quality, manually refined, reference alignments based on 3D structural superpositions and includes new, more challenging representative test cases that cover most of the protein fold space and that represent the real problems encountered when aligning large sets of complex sequences.
In the fields of analysis and exploitation of functional genomics data, we are continuing our efforts aimed at high quality, automated and valid transcriptomics and CGH (Chromosomal Comparative Genomic Hybridization) data treatment and analysis. This involves
- The development of a novel portable method for Affymetrix data-filtering (Flush, in collaboration with Y Pawitan at the Karolinska Institute) using the original raw data, which includes information about the homogeneity of all individual probes in the analysis process and is completely independent of the method (RMA, MAS5, dChip, etc.) used for probe-set summarization,
- The development of a statistical model to discriminate CGH outliers that might indicate microevents,
- The development of the SET (Similarity Enhancing Transformation) method for analysis of multivariate data, such as transcriptomics data. SET simplifies large matrices by minimizing a mean square objective function allowing meta-analysis of microarray data from diverse origins.
- Finally, in order to provide a solution for autonomous routine statistical analysis in high-throughput projects, such as transfected cell arrays, transcription profiling or CGH experiments, we have developed RReportGenerator. RReportGenerator provides a simple and user-friendly graphical user interface (GUI) allowing routine statistical analyses using the powerful R platform, via predefined analysis scenarios in a local and independent manner. All results (text, figures and tables) are automatically assembled into report files and can be complemented by additional files ensuring compatability with external applications (spread-sheet calculation, software, web-browser...).
In the field of database developments, we have designed a multi-level strategy encompassing various aspects of the problems encountered in the development of modern biomedical federative and relational databases.
- The BIRD (Biological Integration and Retrieval of Data) system allows the semi-automated creation and auto-configuration of a relational database in the framework of an original object-relational architecture. BIRD can host simultaneously heterogeneous data (flat files, images, plots, databases…) by providing a limited number of product mapping rules allowing a fast and dynamic retrieval of the information. A generic configurable data model has been designed that allows the simultaneous integration of most of the major biological sequence, genomics, transcriptomics and ontology resources. BIRD is driven with a high level language and query engine, based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. The hosted data can be accessed by the community using various methods such as Web interface, API java or the BIRD-QL Engine Query via HTTP service.
- The RetinoBase is a microarray database, analysis and visualization system allowing powerful queries to retrieve information about gene expression in retina. Data obtained from private or publicly available databases or repositories are automatically curated, treated, analyzed and clustered by different optimized scenarios encompassing public or home-developed algorithms and software. Currently, Retinobase contains datasets from 27 different experiments performed in 4 different model systems (human, mouse, rats, zebra fish) processed with 3 different normalization software and up to 3 distinct clustering methods.
- The EVI-Genoret Database is a federative relational database aimed at providing an infrastructure for managing and templates for the storage, mining and integration of any data or knowledge resulting from the functional genomics of the retina in development, health and disease. As part of the European Integrated Project EVI-GENORET, the EVI-Genoret Database involves a large variety of data, heterogeneous in nature, format and informational content, provided by distinct experts including clinicians, geneticist, molecular biologist, computer scientist… To tackle this problem, the database has been designed around 3 main axes of hierarchized data organization and treatment, namely the Genes-related data which encompass any information or knowledge that can be directly or indirectly related to a gene (mutation, expression, localization…); the Biological Pictures that concern patient-related and clinical data as well as biological features linked to retina in the framework of the development or disease (eye fundus images, electroretinograph…); the Standards and Protocols which provide information concerning the way a given data have been obtained and which is crucial for quality testing, data validation and future establishment and diffusion of de facto standards.
The bioanalysis axis is characterized by the use of information and bioinformatics tools in the framework of specific biological and biomedical studies, notably human disease. In this context, various important results have been obtained, notably in the understanding of retinal disease through the characterization of the RdCVF gene targets (Rod-Derived Cone Viability Factors) which are involved in the trophic dependence existing between the rods and cones in the retina. In the context of coordinated analysis of functional genomics data from transcriptomics and proteomics origins, we have characterised various gene targets involved in human disease (Bardet-Biedl Syndrome, prostate cancer, head and neck carcinoma…). In the framework of the analysis of specific informational protein families involved in the regulation of the gene transcription, a major insight has been obtained in the understanding of nuclear receptor mode of action through the identification of an intramolecular communication pathway involving specific differentially conserved residues. Two distinct conservation patterns have been identified that partitioned the nuclear receptor into two classes exhibiting distinct oligomerization behaviour. This finding paves the way for an in depth understanding of the cascade of interconnected reactions and regulation involving specific ligand and promoter recognition, oligomerization and transcriptional activation. Finally, following our developments in the field of quality data improvement and validation, we have successfully applied an original strategy for the detection of Interrupted CoDing Sequences (ICDS) in prokaryotic genomes, showing that numerous sequence errors are present in the sequence database and implying that complementary biocomputing approaches are necessary to predict and annotate in an efficient way the gene information produced in the post-genomic era.
Bioinformatics concerns most if not all projects presently developed within the IGBMC and the Génopole. From genes to drugs, the aim is to develop or adapt the necessary tools, maintain or develop databases and provide the human skill and experience, integrating biology as well as bioinformatics competences for the development of functional biology and genomics with a strong emphasis on structural biology and genomics.
The bioinformatics projects can be divided into three main centres of interest,
- the development of original software, tools and protocols for the real time maintenance, analysis, visualisation and organisation of genomic data, as well as specialised databases and tools for the exploitation of functional genomics data (in particular, originating from proteomics and DNA chips),
- the development of federative relational databases in the field of biomedical research,
- the development of functional and structural biology and genomics through contributions to various projects (identification of new targets for genomics, functional charactisation, annotation, …).
This aspect also naturally includes training of scientists and the maintenance of bioinformatics services open to the national and international scientific community.
- Following our developments concerning the introduction of phylogenetic inference reasoning in modern biology, we will implement new versions of the MAO and MACSIMS systems suitable for interactomics data integration and exploitation as well as for the identification, formalisation and exploitation of the genetic events that contribute to protein evolution. These developments will be tested and validated in the framework of national or international projects aimed at the analysis of the Muscular Interactome, the analysis of vertebrate evolution or through the characterisation and analysis of the complete set of the mammalian transcription factors. Efforts concerning the automated creation of high quality hierarchized MACS (Multiple Alignment of Complete Sequence) will involve the creation of an expert system aimed at the definition of optimized multiple alignment scenarios depending on the biological application of the MACS, that will evaluate the strengths and weaknesses of various algorithms and integrate specific sequence features, conservation patterns or phylogenetic distribution… These developments will be realized in the framework of our GScope bioinformatics platform which ensures the unifying and interoperability aspects.
- The development of federative relational databases will involve: firstly, the improvement of the BIRD system deployment through the integration of new data and databases, notably from interactomics and human genetic origins and secondly, the creation of new federative databases based on the architecture developed in the EVI-Genoret database project and dedicated to the analysis of the role of the actin cytoskeleton in the Epithelium to Mesenchyme Transition process and in the study of the Muscular Interactome. These developments will be performed in the framework of European integrated projects. In the EVI-Genoret database, special attention will be paid to the developments of new tools aimed at the automated annotation and integration of patient and clinical data, thus ensuring efficient and simplified interconnection and querying opportunities for image or textual data.
- Complementary to our involvement in the identification and analysis of functional genomics data resulting from various human disease projects (retinal diseases, Bardet-Biedl Syndrome, muscular diseases, cancers), we have initiated an original project dedicated to the analysis of the cDNA and genome of Alvinella pompejana, a thermotolerant metazoan. This project is aimed at the understanding of the mechanism involved in the adaptation to extreme conditions and notably to temperature stresses. This will involve not only the annotation, through our optimized computational tools, of thousands of proteins originating from an annelid which represents a poorly studied phylum, but also the deployment of an original strategy to take advantage of the sequence, structure, function and evolution information resulting from the Alvinella project for the understanding of protein and genome evolution.