|
|
(3 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | BIRD System (BIRD,[http://decrypthon.u-strasbg.fr/birdweb/]): Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[http://www-igbmc.u-strasbg.fr] Strasbourg | + | BIRD System : Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[http://www-igbmc.u-strasbg.fr] Strasbourg |
| ==What is the BIRD System== | | ==What is the BIRD System== |
− | ===Scientific Context===
| |
− | Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. As a consequence, modern biomedical information corresponds to a high volume of heterogeneous data that is increasing exponentially (Statistics NCBI) and perhaps more importantly, that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon) has been instigated (resulting from a collaboration between AFM/CNRS/IBM) firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres at French universities and secondly, to facilitate access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a large computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to handle automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management.
| |
− |
| |
| ===BIRD System Overview=== | | ===BIRD System Overview=== |
− | The BIRD System (Nguyen et al, CORIA 2008, Hermes Edition) was designed to manage large collections of biological data ([[Bird_Databases_List]]) and to perform intensive computation and simulation. BIRD has inherited some of the idealogy of the Saada project [http://amwdb.u-strasbg.fr/saada/article.php3?id_article=32]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of a database according to semantic topics and real requirements. | + | The BIRD System was designed to manage large collections of biological data ([[Bird_Databases_List]]) and to perform intensive computation and simulation. BIRD has inherited some of the idealogy of the Saada project [http://amwdb.u-strasbg.fr/saada/article.php3?id_article=32]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of a database according to semantic topics and real requirements. |
| BIRD is driven by a high level query engine (BIRD-QL), based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable of generating sub-databases in accordance with the real requirements of a given project. | | BIRD is driven by a high level query engine (BIRD-QL), based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable of generating sub-databases in accordance with the real requirements of a given project. |
| | | |
Line 15: |
Line 12: |
| | | |
| | | |
− | The first goal of the Bird System is the implementation of the Décrypthon Data Center [http://decrypthon.u-strasbg.fr/birdweb/] [http://decrypthon-1.ens-lyon.fr:8080/birdweb] in the framework of the Décrypthon Programme (AFM/CNRS/IBM ) [http://www.decrypthon.fr] | + | The first goal of the Bird System is the implementation of the Décrypthon Data Center in the framework of the Décrypthon Programme (AFM/CNRS/IBM ) [http://www.decrypthon.fr] |
| | | |
| ==[[BIRDQL]] Biological Query Language == | | ==[[BIRDQL]] Biological Query Language == |
Line 28: |
Line 25: |
| see more [[BIRDQL]] | | see more [[BIRDQL]] |
| | | |
− | ==[[BIRD Data Access Protocol]]s==
| |
− | Several protocols are available see more [[BIRD Data Access Protocol]]
| |
− |
| |
− | ==BIRD KDD-Knowledge Discovery ==
| |
− |
| |
− | BIRD Databases are compatible with DB2 Miner Intelligent
| |
− |
| |
− |
| |
− | ===Theories and Functionalities===
| |
− |
| |
− | KDD Steps
| |
− | [[Image:kddstep.jpg]]
| |
− |
| |
− |
| |
− | KDD Tecnhique & Algorithm
| |
− | [[Image:algo3.jpg]]
| |
− |
| |
− | KDD Data Model & View
| |
− | [[Image:modelview.jpg]]
| |
− |
| |
− |
| |
− | ====Association rule learning====
| |
− | a.'''What Is Association Rule Mining?'''
| |
− |
| |
− | Describing association relationships among the attributes in the set of relevant data
| |
− |
| |
− | Frequent pattern mining: find all frequent patterns in a database
| |
− |
| |
− | Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
| |
− |
| |
− | Frequent pattern mining: finding regularities in data
| |
− |
| |
− | +What products were often purchased together? Beer and diapers?!
| |
− |
| |
− | +What are the subsequent purchases after buying a product( ex. car)?
| |
− |
| |
− | +Can we automatically profile patient or gene ?
| |
− |
| |
− | Example in BIRD-QL
| |
− |
| |
− | [[Image:birdqlrules.jpg]]
| |
− |
| |
− | b.'''Basic'''
| |
− |
| |
− | Rule Definition
| |
− |
| |
− | Body ==> Consequent [ Support , Confidence ]
| |
− | (IF <> THEN <>)
| |
− | Body: represents the examined data.
| |
− | Consequent: represents a discovered property for the examined data.
| |
− | Support: represents the percentage of the records satisfying the body or the consequent.
| |
− | Confidence: represents the percentage of the records satisfying both the body and the
| |
− | consequent to those satisfying only the body
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | Itemset: a set of items
| |
− |
| |
− | =>E.g., acm={a, c, m}
| |
− |
| |
− | Support of itemsets
| |
− |
| |
− | =>Sup(acm)=3
| |
− |
| |
− | Given min_sup=3, acm is a frequent pattern
| |
− |
| |
− | Frequent pattern mining: find all frequent patterns in a database
| |
− |
| |
− |
| |
− | [[Image:rulesbasic.jpg]]
| |
− |
| |
− |
| |
− |
| |
− | c.'''Apriori Algorithm'''
| |
− |
| |
− |
| |
− | Ck: Candidate itemset of size k
| |
− | Lk : frequent itemset of size k
| |
− |
| |
− | L1 = {frequent items};
| |
− | for (k = 1; Lk !=Q; k++) do
| |
− | Ck+1 = candidates generated from Lk;
| |
− | for each transaction t in database do increment the count of all candidates in Ck+1 that are
| |
− | contained in t
| |
− | Lk+1 = candidates in Ck+1 with min_support
| |
− | return UkLk; (Union)
| |
− |
| |
− | [[Image:Apriori.jpg]]
| |
− |
| |
− | ====Kohonen´s feature maps====
| |
− | A Kohonen’s self organizing feature map (K-map) uses analogy with biological neural
| |
− | structures where the placement of neurons is orderly and reflects the structure of external (sensed)
| |
− | stimuli (e.g. in auditory and visual pathways).
| |
− |
| |
− | A K-map learns, when continuous-valued input vectors are presented to it, without specifying the
| |
− | desired output. The weights of connections can adjust to regularities in the input. A large number of
| |
− | examples is needed.
| |
− |
| |
− | K-map mimics well learning in biological neural structures. It is used in speech recognizers.
| |
− |
| |
− | This is a flat (two-dimensional) structure with connections between neighbors and connections
| |
− | from each input node to all its output nodes.
| |
− |
| |
− | It learns clusters of input vectors without any help from a teacher. It also preserves closeness (topology).
| |
− |
| |
− | '''Learning in K-maps'''
| |
− |
| |
− | 1. Initialize weights to small random numbers and set initial radius of neighborhood of nodes.
| |
− |
| |
− | 2. Get an input x1, …, xn.
| |
− |
| |
− | 3. Compute distance dj to each output node:
| |
− | dj = (xi - wij)2
| |
− |
| |
− | 4. Select output node s with minimal distance ds.
| |
− |
| |
− | 5. Update weights for the node s and all nodes in its neighborhood:
| |
− | wij´= wij + h* (xi - wij), where h<1 is a gain that decreases in time.
| |
− |
| |
− | Repeat steps 2 - 5.
| |
− |
| |
− | ===DB2 Intelligent Miner (API)===
| |
− |
| |
− | Data flow of the mining procedure (FindDeviations ex.)
| |
− | [[Image:kdd_model.jpg]]
| |
− | Finding deviations
| |
− |
| |
− | Finding groups with similar characteristics (ClusterTable procedure)
| |
− |
| |
− | You can find groups with similar characteristics using the ClusterTable procedure.
| |
− | When to do it:
| |
− | The database might contain patient data including demographic data, for example: v Gender v Age v
| |
− | Profession v Family status The information might also include the income or the socio-demographic group of the customer
| |
− |
| |
− |
| |
− | Finding relationships (FindRules procedure) You can find relationships in your data using the FindRules procedure.
| |
− |
| |
− |
| |
− | Predicting future behavior (PredictColumn procedure)
| |
− |
| |
− | In the tables or views of your database (Transciptomic or clinical Data), there might
| |
− | be one column that you are particularly interested in. In the clinical data, you can find
| |
− | relations between symptoms and diseases. With this information, you can predict the potential diseases of new patients
| |
− |
| |
− | Finding most important fields (FindMostImpFields procedure)
| |
− |
| |
− | You can find the most important fields using the FindMostImpFields procedure.
| |
− |
| |
− |
| |
− | Example in BIRD-QL
| |
− |
| |
− | [[Image:deviation.jpg]]
| |
− |
| |
− |
| |
− | ==[[MAP Semantic]]==
| |
− | [[Image:Carte.jpg]]
| |
− |
| |
− | The [[BIRD]] data warehouse will be equipped with various tools aimed at visualizing in a semantic manner the large volume of data it contains. Typically, clustering tools or self-organizing maps can be produced to visualize “land maps” representing the distribution of genes and their various annotations in the warehouse (protein families, organism, motif composition, 3D structure, genetic disease, etc.). This visualisation will be exploited to generate semantic networks that will contribute to the construction of the semantic framework of the project. In particular it should be helpful for guiding the subsequent relational data mining step.
| |
− |
| |
− | This project (image above) is under development at IGBMC (Nicolas Wicker & Hoan Nguyen , Jeremy Trouslard, Julien Cadet...)
| |
− |
| |
− | ==[[Decrypthon Data Center]]==
| |
− |
| |
− |
| |
− |
| |
− | ===Overview===
| |
− |
| |
− | [[Image:ddc_idea.jpg]]
| |
− |
| |
− | The BIRD System represents the core of the Décrypthon Data Center.
| |
− | Sharing of large scale biological data for applications (Macsims, MS2PH, Magos, Ordalie..)
| |
− | Running on the Décrypthon Grid.
| |
− | Management of generated data (results) on the Grid
| |
− | Sharing of data and services for the scientific community
| |
− | http://decrypthon.u-strasbg.fr/birdweb/
| |
− |
| |
− | ==MACSIMS uses the BIRDQL engine==
| |
− | MACSIMS:Multiple Alignment of Complete Sequences Information Management System (Thompson et al, 2006). MACSIMS provides a unique environment for the analysis of all the information related to a given protein family, facilitating knowledge extraction and the presentation of the most pertinent information to the biologist.
| |
− |
| |
− | Macsims uses a direct connection to the Bird database
| |
− |
| |
− | ==GPS uses the BIRDQL engine==
| |
− | http://gps.nucleic.fr
| |
− |
| |
− | ==Gscope utilise BIRD==
| |
− | Gscope can now establish a direct connection with the Bird system
| |
− |
| |
− |
| |
− | * proc '''BirdFromQueryText''' {Texte {OutFile ""} {BirdUrl ""}}
| |
− | * proc '''BirdFromQueryFile''' {Fichier {OutFile ""} {BirdUrl ""}}
| |
− |
| |
− | In addition, BIRD can integrate information files from a Gscope project. The user can then query the files directly either by http or by Gscope, or even better, using the command '''BirdGscopeSearch'''
| |
− |
| |
− | ==[[BIRD Development]]==
| |
− | see more [[BIRD Development]]
| |
− |
| |
− | ==[[BIRD KDE or ILBLab]]==
| |
− |
| |
− | ILPLab is an inductive logic programming (http://www.doc.ic.ac.uk/~shm/ilp.html) laboratory [[ILPLab]]
| |
− |
| |
− | ==Publications==
| |
− |
| |
− |
| |
− |
| |
− | 1. Nguyen H., Berthommier G., Friedrich A., Poidevin L. ,Ripp R. , Moulinier L. and Poch O. Introduction du nouveau centre de données biomédicales Décrypthon, CORIA 2008, Hermes Edition. See PDF, [http://bird.u-strasbg.fr:8080/bird/temp/BIRDFinalCoria08.pdf]
| |
− |
| |
− | 2. Nguyen N-H.*, Wicker N.*., Kieffer D, Poch O. (2010). “A new projection method for biological semantic map generation.” J. Biomedical Science and Engineering, 2010, 3, 13-19., [http://www.scirp.org/Journal/Abstract.aspx?paperID=1130&JournalID=30].
| |
− | '*' These authors contributed equally to this work
| |
− |
| |
− | 3. Friedrich A.*, Garnier N.*, Gagnière N., Nguyen H., Albou LP., Biancalana V., Bettler E., Deléage G., Lecompte O., Muller J., Moras D., Mandel JL., Toursel T., Moulinier L., Poch O.
| |
− | SM2PH-db[http://decrypthon.igbmc.fr/sm2ph/cgi-bin/home]: an interactive system for the integrated analysis of phenotypic consequences of missense mutations in proteins involved in human genetic diseases.
| |
− | Hum Mutat. 2009 Nov 17. (PMID: 19921752)[http://www3.interscience.wiley.com/journal/122684513/abstract].
| |
− | '*' These authors contributed equally to this work
| |
− |
| |
− | 4. Analyse de données transcriptomiques: Modélisation floue de profils d'expression différentielle et analyse fonctionnelle.
| |
− | Benabderrahmane S., Devignes M.-D., Smaïl-Tabbone M., Poch O., Napoli A., Nguyen N.-H N., Raffelsberger W.
| |
− | Actes du XXVIIième congrès Informatique des Organisations et Systèmes d'information et de décision - INFORSID 2009, France (2009) [inria-00394530 − version 1]
| |
− |
| |
− | 5. Nguyen H., Michel L., Motch C. (2006). « Building an Astronomi-cal Database with Saada”, Astronomical Data Analysis Software and Systems XV, Madrid, Spain, Astronomical Society of the Pacific, ASP Conference Series, vol. 351.
| |
− |
| |
− |
| |
− | 6. Discovering knowledge hidden in mutation data using Inductive Logic Programming, is preparing for...
| |
− | (Tien-Dao Luu, Ngoc-Hoan Nguyen, Anne Friedrich, Jean Muller, Luc Moulinier and Olivier Poch)
| |
− |
| |
− |
| |
− | 7. N. BARD , R. BOLZE, E. CARON, F. DESPREZ. M. HEYMANN, A. FRIEDRICH, L. MOULINIER, N.H. NGUYEN, O. POCH, T. TOURSEL. "Décrypthon Grid Resources Dedicated to Neuromuscular Disorders" (2010). Studies in Health Technology and Informatics published by IOS Press. (paper accepted, in preparation).
| |
− | All authors contributed equally to this work.
| |
− |
| |
− | 8. "Conception of the BIRD System" is preparing for .....
| |
− |
| |
− |
| |
− | 9. "BIRDQL-A new Biological Query Language " is preparing for....
| |
− |
| |
− | ==Powerpoint Presentations of BIRD System & SM2PH& DDC ==
| |
− |
| |
− | 1. BIRD System presentation (Decrypthon Meeting,ENS-Lyon, 11 Mai 2007). See ppt, [http://bird.u-strasbg.fr:8080/bird/temp/DECRYPTHON_BIRD_IBM.ppt]
| |
− |
| |
− | 2. Semantic Map and BIRD System (poster,APBC, Pekin 09). See PDF, [http://bird.u-strasbg.fr:8080/bird/temp/SemanticMapPekin09.ppt]
| |
− |
| |
− | 3. BIRD System presnetation to IBM Watson Lab(Online demo, Strasbourg, Mar 2009). See ppt, [http://bird.u-strasbg.fr:8080/bird/temp/BIRDSystemDemo_IBM.ppt]
| |
− |
| |
− |
| |
− |
| |
− | 4. Decrypthon : From “omics” grid-computing facilities towards medical bioinformatics , See ppt, [http://bird.u-strasbg.fr:8080/bird/temp/BIRD_SM2PH_080409.ppt]
| |
− |
| |
− | ==Contact==
| |
− | Nguyen Ngoc Hoan,PhD
| |
− | IGBMC Strasbourg
| |
− | 1 rue Laurent Fries
| |
− | BP 10142
| |
− | 67404 Illkirch CEDEX / France
| |
− | Mail:[mailto:nguyen@igbmc.fr nguyen@igbmc.fr]
| |
− | Tel: 0033 388653302
| |
− | --[[User:Nguyen|Nguyen]] 15:07, 16 February 2008 (CET)---
| |
| | | |
− | ==FAQ?==
| |
| | | |
| [[Category:Bird_project]] | | [[Category:Bird_project]] |
BIRD System : Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[1] Strasbourg
What is the BIRD System
BIRD System Overview
The BIRD System was designed to manage large collections of biological data (Bird_Databases_List) and to perform intensive computation and simulation. BIRD has inherited some of the idealogy of the Saada project [2]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of a database according to semantic topics and real requirements.
BIRD is driven by a high level query engine (BIRD-QL), based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable of generating sub-databases in accordance with the real requirements of a given project.
The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query.
The BIRD System is developed using the Java technology and uses the IBM DB2 as the data server, as well as the Websphere Federation Server for virtual databases. The web application is hosted either by a Tomcat Server or by a WebSphere Application Server.
The BIRD System is not only a data retrieval tool, but also provides a platform for Knowledge Discovery in Biological Databases or an inductive database. We use the IBM Intelligent Miner (association rules, classification, ..) in order to develop the data mining model. The user can then use BIRD-QL for mining pertinent information or for analyzing the relational patterns based on the descriptive patterns available in the BIRD-QL engine.
The first goal of the Bird System is the implementation of the Décrypthon Data Center in the framework of the Décrypthon Programme (AFM/CNRS/IBM ) [3]
BIRDQL Biological Query Language
The heterogeneous data integrated in the BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious and can only be performed by expert developers or computer scientists.
In this context, building complex queries with SQL involves the use of joins (technical term) to select data in multiple tables. This complexity can be hidden by HTML forms, but many types of queries cannot be specified with HTML forms.
We have therefore developed our own query language (BIRDQL), which is a new biological query language that allows the biologist or clinician to create data retrieval protocols without requiring exhaustive knowledge of the data sources and their architecture. BIRDQL makes it possible for biologists to easily express queries and to extract knowledge using classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).
BIRDQL in not a mathematically complete language but instead is an idiom that is adapted to the GUI and is human readable enough to be modified by hand.
see more BIRDQL