Difference between revisions of "BIRD"

From Wikili
Jump to: navigation, search
 
(140 intermediate revisions by 3 users not shown)
Line 1: Line 1:
BIRD System (BIRD): Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[http://www-igbmc.u-strasbg.fr] Strasbourg  
+
BIRD System : Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[http://www-igbmc.u-strasbg.fr] Strasbourg  
==What is BIRD System==
+
==What is the BIRD System==
===Scientific Context===
 
Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. Thus, modern biomedical information corresponds to a high volume of heterogeneous data that doubles in size every year (Statistics NCBI) and that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon), resulting from a collaboration between AFM/CNRS/IBM, has been instigated, firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres of French universities and, secondly to provide a facilitated access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a strong computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to manage automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management.
 
 
 
 
===BIRD System Overview===
 
===BIRD System Overview===
BIRD System (Nguyen et al, CORIA 2008, Hermes Edition) was designed to manage large collections of biological data and to intensive computation and simulation. BIRD heritages somes main idea of Saada project[http://amwdb.u-strasbg.fr/saada/article.php3?id_article=32]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of the database according to semantic topics and real requirements.  
+
The BIRD System was designed to manage large collections of biological data ([[Bird_Databases_List]]) and to perform intensive computation and simulation. BIRD has inherited some of the idealogy of the Saada project [http://amwdb.u-strasbg.fr/saada/article.php3?id_article=32]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of a database according to semantic topics and real requirements.  
BIRD is driven with a high level query engine, based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable to generate the sub-bank of data in accordance with the real requirement.  
+
BIRD is driven by a high level query engine (BIRD-QL), based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable of generating sub-databases in accordance with the real requirements of a given project.  
  
 
The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query.  
 
The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query.  
  
BIRD System is developed with the Java technology. BIRD System uses IBM DB2 for data server; Websphere Federtion Server for virtual databases. The web application is hosted by a Tomcat Server or by a WebSphere Application Server.  
+
The BIRD System is developed using the Java technology and uses the IBM DB2 as the data server, as well as the Websphere Federation Server for virtual databases. The web application is hosted either by a Tomcat Server or by a WebSphere Application Server.  
BIRD System is not only a retrieval data system but also a plate-forme of Kownlegde Discovery in Biological Database. We use IBM Miner Intelligent (association rules, classification, ..) in order  to develop the data mining model.
 
 
 
 
 
The main goal of Bird System is to implementation of the Décrypthon Data Center [http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do] [http://decrypthon-1.ens-lyon.fr:9080/BirdSystem/HomePage.do] in the framework of Décrypthon Programme (AFM/CNRS/IBM ) [http://www.decrypthon.fr]
 
 
 
==DATABASES List ==
 
GENBANK, EST, WGS, REFSEQ, PDB, UNIPROT, UCSC, INTERPRO, GO, TAXONOMY, MACSIM, EVI-GENORET (local user), STRING (local user), UMD Data (local user), ...
 
 
 
==BIRDQL Biological Query Language ==
 
===BIRDQL in few words===
 
The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.
 
 
 
Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.
 
 
 
We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).
 
 
 
BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand.
 
 
 
===BIRDQL Grammar ===
 
 
 
ID  <list of id/ac/query_id > DB  <bank names>
 
 
 
WH  Field[http://d1.crihan.fr:8080/bird/bsearch?service=metadata&db=all] Contains (kw1 & kw2) | kw_n
 
 
 
WH  PATTERN <function SequencePattern() >
 
 
 
WH  PATTERN <function DiagonalMolecule()>
 
 
 
WH  PATTERN <function InteractionProtein()>
 
 
 
WH  PATTERN <function AssociationRule()>
 
 
 
FD  <Field[http://d1.crihan.fr:8080/bird/bsearch?service=metadata&db=all] out>
 
 
 
LM  <n>
 
 
 
FM  Fasta/Flat/Xml/CSV/Simple/Object/OID
 
 
 
===BIRDQL example===
 
Two other examples below also show how to use the BIRD-QL syntax.
 
 
 
'''Example 1''': simple query, search and fasta format generation
 
 
 
 
 
ID * DB UNIPROT
 
 
 
WH DE contains "synthetase" & "tyrosyl"
 
 
 
WH OX contains 382
 
 
 
FD AC, ID,DE,OX,SQ
 
 
 
FM FASTA
 
 
 
 
 
Result
 
 
 
 
 
>Q92PK5 | SYY_RHIME | Tyrosyl-tRNA synthetase (EC 6.1.1.1) (Tyrosine--tRNA ligase) (TyrRS). | 382
 
MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...
 
 
 
'''Example 2''': complex query
 
 
 
ID * DB GENBANK, REFSEQ
 
 
 
WH OC Contains "Eukaryote"
 
 
 
WH DR Contains "GO"
 
 
 
WH GENE contains "GF100027"
 
 
 
FM FASTA
 
 
 
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
 
 
 
 
 
'''Example 2''': complex query
 
 
 
ID * DB GENBANK, REFSEQ
 
 
 
WH OC Contains "Eukaryote"
 
 
 
WH DR Contains "GO"
 
 
 
WH GENE contains "GF100027"
 
 
 
FM FASTA
 
 
 
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
 
 
 
'''Example 3''': mining in EST
 
 
 
ID * DB GBEST
 
 
 
WH TISSUE_TYPE contains "retina"
 
 
 
WH DEV_STAGE contains "adult"
 
 
 
LM 100
 
 
 
FD AC,DE,OX,OC,tissue_type,dev_stage,chr
 
 
 
FM FLAT
 
 
 
'''Example 4''': Mining in EST
 
 
 
ID CJ133635,CJ133593,CJ133659 DB GBEST
 
 
 
WH DE contains "AMINOTRANSFERASE"
 
 
 
WH OC contains  "Eukaryota" & not "Metazoa"
 
 
 
WH TISSUE_TYPE contains "retina"
 
 
 
FD AC,DE,OX,OC,tissue_type,dev_stage,chr
 
 
 
FM FLAT
 
 
 
 
 
'''Example 5''': Mining in EST
 
 
 
ID * DB GBEST
 
 
 
WH TISSUE_TYPE contains "colon"
 
 
 
WH DEV_STAGE contains "adult"
 
 
 
LM 100
 
 
 
FD AC,DE,OX,OC,tissue_type,dev_stage,chr,os
 
 
 
FM FLAT
 
 
 
 
 
'''Example 6''': Mining In PDB
 
 
 
ID * DB PDB
 
 
 
WH DE contains "ERYTHRINA CORALLODENDRON LECTIN IN COMPLEX"
 
 
 
WH OS contains "ERYTHRINA CORALLODENDRON"
 
 
 
WH RESO contains 1.90
 
 
 
LM 10
 
 
 
FM FASTA
 
 
 
 
 
'''Example 7''': running SQL Native
 
 
 
ID * DB STRING
 
 
 
WH SQLNATIVE select * from items.proteins
 
 
 
Limit 100
 
 
 
FM CSV
 
 
 
 
 
 
 
[[Image:Example.jpg]]
 
 
 
==Data Access Protocoles==
 
===Data Browsing at Décrypthon Data Center===
 
Database content can be browsed from BIRD System WEB
 
Node ENS-Lyon: [http://decrypthon-1.ens-lyon.fr:9080/BirdSystem/HomePage.do]
 
Node IGBMC: [http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do]
 
 
 
===Data Selection by BIRD-QL Service===
 
Data can also be selected with BIRD-QL queries; Expert users can however modify queries by hand. Trois query service are available:
 
      1. curl -F upload=@your_bird.ql 'http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql'
 
 
 
 
 
      2. http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=your_birdql
 
      Example:  http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=
 
                ID * DB Uniprot--WH DE contains "histone"--LM 10--FD AC,DE--FM FLAT
 
                http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=
 
                ID * DB Uniprot--WH DE contains "Helianthinin"--LM 10--FM FASTA
 
 
 
 
 
      3. BIRD-QL Editor (in pres).
 
User can use this engine for intensive computation, download [birdql cmd].
 
 
 
===Simple Services-Bank ID===
 
 
 
Service :   
 
http://bird.u-strasbg.fr:8080/bird/bsearch?db=<database>&accession=<ac or id>&field=<DE,OS..> &format=<fasta/flat>
 
 
 
Example 1: get EST
 
      http://bird.u-strasbg.fr:8080/bird/bsearch?db=gbest&accession=Cj133605&field=DE,OS,OC,TISSUE_TYPE,DEV_STAGE
 
 
 
Example 2: get Protein :   
 
      http://bird.u-strasbg.fr:8080/bird/bsearch?db=uniprot&accession=Q23456
 
 
 
Example 3:  get PDB :       
 
    http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&accession=1XDS
 
 
 
Example 4: get Fasta : 
 
    http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&accession=1XDS&format=fasta
 
 
 
===WEB Server===
 
    http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do  (firefox)
 
 
 
===API JAVA - BIRDQL Client===
 
 
 
The API is an Interface of programming which defines the way in which a data-processing component can communicate with another. The API Java of BIRD contains useful reusable classes by external modules of access to the databases. It has functions (methods) returning the data selected under various formats.
 
The user of high-level can use the API to develop new functionalities exploiting of the data. It can also be used to make personalized graphic interfaces and Web Services. The codes Java below illustrate the exploitation of BIRD API.
 
BIRDQL Engine doesn’t return data but just OIDs of selected records. The content of the record must then be searched by the API.
 
 
 
// API BIRD
 
 
 
Import org.igbmc.bird.* 
 
Class ExampleUtilisationAPI {
 
  InterfactDB birddb  = new InterfaceDB(“my-bird”)
 
  // BIRD-QL
 
      String birdql = ” ID * DB UniProt
 
                      --WH OS contains "Mus mus"
 
                      --WH OC contains "Eukaryota" & not "Metazoa"
 
                      --FM OID”
 
  Vector OID=birddb.queryengine.run(birdql); 
 
  For { i=1 to OID.size()  }
 
  { // result treatment
 
    UniProt obj=(UniProt)birddb.getObject(OID[i]);
 
            ….
 
  }
 
 
 
 
 
// BIRDQL CLient
 
 
 
java org.igbmc.bird.datadiscovery.BirdQLClient birdql  nameServer outFile
 
 
 
@birdql    : file name contains your bird-ql query
 
 
 
@nameServer: name of BIRD Server (d1.crihan.fr or bird.u-strasbg.fr)
 
 
 
@outFile  : file name, the result will be print to this file
 
 
 
==BIRD business intelligence ==
 
 
 
===Kownledge Discovery in Biological Database===
 
 
 
===DB2 Miner Intelligent (API)===
 
 
 
===Use cases===
 
Transciptomic
 
 
 
Protein Protein Interaction Pattern
 
 
 
==BIRD System in Action ==
 
===Décrypthon Data Center===
 
 
 
 
 
====Overview====
 
 
 
[[Image:ddc_idea.jpg]]
 
 
 
BIRD System is core of Décrypthon Data Center.
 
  Sharing of large scare biological data for applications (Macsim, MS2PH, Macgos, Ordali..) 
 
  runing on Décrypthon Grid.
 
  Managing of generated data (result) on grid 
 
 
 
  Sharing of data and services for scientific community
 
  http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do
 
 
 
 
 
[[Image:bird_ddc.jpg]]
 
 
 
===Macsim uses BIRDQL engine===
 
MACSIMS:Multiple Alignment of Complete Sequences Information Management System  (Thompson et al, 2006).MACSIMS provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist .
 
 
 
Macsim gets direct connection with Bird database
 
 
 
===GPS uses BIRDQL engine===
 
http://nucleic.fr
 
 
 
===Gscope utilise BIRD===
 
Gscope peut se mettre maintenant en connexion directe avec Bird
 
 
 
 
 
* proc '''BirdFromQueryText''' {Texte {OutFile ""} {BirdUrl ""}}
 
* proc '''BirdFromQueryFile''' {Fichier {OutFile ""} {BirdUrl ""}}
 
 
 
Bird sait intégrer les fiches infos d'un projet Gscope. On peut alors les interroger directement par http ou par Gscope ou, mieux, par des affiches avec la commande '''BirdGscopeSearch'''
 
 
 
==Publications==
 
 
 
To cite BIRD System, please use the following publication;
 
 
 
1. Nguyen H., Berthommier G., Friedrich A., Poidevin L. ,Ripp R. , Moulinier L. and Poch O. Introduction du nouveau centre de données biomédicales Décrypthon, CORIA 2008.
 
 
 
2. "Conception of the BIRD System"  is preparing for .....
 
 
 
3. "BIRDQL-A new Biological Query Language " is preparing for....
 
 
 
==BIRD Development ==
 
 
 
[[lbgiki:BIRD_implementation|BIRD Implementation]]
 
 
 
===Origin BIRD System (SAADA)===
 
BIRD was based on main principe of Saada project [http://amwdb.u-strasbg.fr/saada/article.php3?id_article=32].
 
  SAADA - Systèm d’Archivage Automatique des Données Astronomiques
 
  First Goal : Archive & Exploitation of Data of  the European XMM Newton satellite  [http://xcatdb.u-strasbg.fr/2xmm/home].The 2XMM catalogue of X-ray sources,
 
    the largest of its kind ever, has now been released.
 
  In a PhD Framework (2002-2005, Prototype Saada V.1.3) of Dr.NGUYEN  at University of Strasbourg I,Supported by the 
 
    CNES[www.cnes.fr] and the Alsace Region,  Supervised by Dr. Michel and Dr.Motch.
 
 
 
 
 
[[Image:saada_bird.jpg]]
 
 
 
===Data Model===
 
  
===Query Engine===
+
The BIRD System is not only a data  retrieval tool, but also provides a platform for Knowledge Discovery in Biological Databases or an inductive database. We use the IBM Intelligent Miner (association rules, classification, ..) in order to develop the data mining model. The user can then use BIRD-QL for mining  pertinent information or for analyzing the relational patterns based on the descriptive patterns available in the BIRD-QL engine.
  
===Data Integration===
 
  
===Architecture===
+
The first goal of the Bird System is the implementation of the Décrypthon Data Center in the framework of the Décrypthon Programme (AFM/CNRS/IBM ) [http://www.decrypthon.fr]
[[Image:bird_arch.jpg]]
 
  
===Key Technologies===
+
==[[BIRDQL]] Biological Query Language ==
Relational Core store
 
  
  IBM DB2 WareHouse V9.1
+
The heterogeneous data integrated in the BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious and can only be performed by expert developers or computer scientists.  
  WebSphere Federation Server
 
  
WEB Server & Services
+
In this context, building complex queries with SQL involves the use of joins (technical term) to select data in multiple tables. This complexity can be hidden by HTML forms, but many types of queries cannot be specified with HTML forms.
  
  IBM WebSphere Application Server ( main Portal)
+
We have therefore developed our own query language ([[BIRDQL]]), which is a new biological query language that allows the biologist or clinician to create data retrieval protocols without requiring exhaustive knowledge of the data sources and their architecture. BIRDQL makes it possible for biologists to easily express queries and to extract knowledge using classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).
  Tomcat Server (services, non graphic)
 
  Hibernate and JSF-Java Server Face
 
  Object Relational Mapping
 
  Web component
 
  
XML & JAVA
+
[[BIRDQL]] in not a mathematically complete language but instead is an idiom that is adapted to the GUI and is human readable enough to be modified by hand.
 +
see more [[BIRDQL]]
  
===Project Distribution===
 
  NO
 
  
===Team===
 
  Nguyen H., Berthommier G., Friedrich A., Poidevin L. ,Ripp R. , Moulinier L. and Poch O
 
  
Contact:
+
[[Category:Bird_project]]
  Nguyen Ngoc Hoan
 
  IGBMC Strasbourg
 
  Mail:[mailto:nguyen@igbmc.fr nguyen@igbmc.fr]
 
  Tel: 0033 388653302
 
--[[User:Nguyen|Nguyen]] 15:07, 16 February 2008 (CET)---
 
==FAQ==
 

Latest revision as of 08:18, 1 October 2013

BIRD System : Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[1] Strasbourg

What is the BIRD System

BIRD System Overview

The BIRD System was designed to manage large collections of biological data (Bird_Databases_List) and to perform intensive computation and simulation. BIRD has inherited some of the idealogy of the Saada project [2]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of a database according to semantic topics and real requirements. BIRD is driven by a high level query engine (BIRD-QL), based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable of generating sub-databases in accordance with the real requirements of a given project.

The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query.

The BIRD System is developed using the Java technology and uses the IBM DB2 as the data server, as well as the Websphere Federation Server for virtual databases. The web application is hosted either by a Tomcat Server or by a WebSphere Application Server.

The BIRD System is not only a data retrieval tool, but also provides a platform for Knowledge Discovery in Biological Databases or an inductive database. We use the IBM Intelligent Miner (association rules, classification, ..) in order to develop the data mining model. The user can then use BIRD-QL for mining pertinent information or for analyzing the relational patterns based on the descriptive patterns available in the BIRD-QL engine.


The first goal of the Bird System is the implementation of the Décrypthon Data Center in the framework of the Décrypthon Programme (AFM/CNRS/IBM ) [3]

BIRDQL Biological Query Language

The heterogeneous data integrated in the BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious and can only be performed by expert developers or computer scientists.

In this context, building complex queries with SQL involves the use of joins (technical term) to select data in multiple tables. This complexity can be hidden by HTML forms, but many types of queries cannot be specified with HTML forms.

We have therefore developed our own query language (BIRDQL), which is a new biological query language that allows the biologist or clinician to create data retrieval protocols without requiring exhaustive knowledge of the data sources and their architecture. BIRDQL makes it possible for biologists to easily express queries and to extract knowledge using classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).

BIRDQL in not a mathematically complete language but instead is an idiom that is adapted to the GUI and is human readable enough to be modified by hand. see more BIRDQL