Difference between revisions of "BIRD"
(→BIRDQL Grammar) |
(→BIRDQL in few words) |
||
Line 23: | Line 23: | ||
The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts. | The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts. | ||
Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms. | Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms. | ||
− | We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: | + | We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...). |
BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand. | BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand. | ||
Revision as of 09:00, 13 February 2008
BIRD System: Biological Integration and Retrieval Data was designed by Hoan Nguyen, nguyen@igbmc.u-strasbg.fr at LBGI laboratory (IGBMC[1] Strasbourg)
Contents
What is BIRD System
Scientific Context
Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. Thus, modern biomedical information corresponds to a high volume of heterogeneous data that doubles in size every year (Statistics NCBI) and that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon), resulting from a collaboration between AFM/CNRS/IBM, has been instigated, firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres of French universities and, secondly to provide a facilitated access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a strong computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to manage automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management.
BIRD System Overview
BIRD System (Nguyen et al, CORIA 2008, Hermes Edition) is designed to manage collections of biological data. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of the database according to semantic topics and real requirements. BIRD is driven with a high level query engine, based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable to generate the sub-bank of data in accordance with the real requirement.
The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query (via HTTP service or API Java).
BIRD is developed with the Java technology. BIRD uses IBM DB2 for data server; Websphere Federtion Server for virtual databases and Miner Intelligent for KDD. The web application is hosted by a Tomcat Server or by a WebSphere Application Server.
Mirror at Decrypthon: [[2]]
Mirror at IGBMC: [[3]]
DATABASES List
GENBANK, EST, WGS, REFSEQ, PDB, UNIPROT, UCSC, INTERPRO, GO, TAXONOMY, MACSIM, EVI-GENORET (local user), STRING (local user), UMD Data (local user), ...
BIRDQL Biological Query Language
BIRDQL in few words
The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts. Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms. We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...). BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand.
BIRDQL Grammar
ID <list of id/ac/query_id > DB <bank names>
WH Field[4] Contains kw1 |& kw2 |& kw_n
WH PATTERN <function SequencePattern() >
WH PATTERN <function DiagonalMolecule()>
WH PATTERN <function InteractionProtein()>
WH PATTERN <function AssociationRule()>
WH PATTERN <function AssociationRule()> LD <Field out>
FM <n>
FM Fasta/Flat/Xml/CSV/Simple/Object
BIRDQL example
Two other examples below also show how to use the BIRD-QL syntax.
Example 1: simple query, search and fasta format generation
ID * DB UNIPROT
WH DE contains "synthetase" & "tyrosyl"
WH OX contains 382
FD AC, ID,DE,OX,SQ
FM FASTA
Result
>Q92PK5 | SYY_RHIME | Tyrosyl-tRNA synthetase (EC 6.1.1.1) (Tyrosine--tRNA ligase) (TyrRS). | 382
MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...
Example 2: complex query
ID * DB GENBANK, REFSEQ
WH OC Contains "Eukaryote"
WH DR Contains "GO"
WH GENE contains "GF100027"
FM OID
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
Example 3: complex query
ID * DB GENBANK, REFSEQ WH OC Contains "Eukaryote" WH DR Contains "GO" WH GENE contains "GF100027" FM FASTA
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
DATA ACCESS Protocoles
Data Browsing at Decrypthon Data Center
Database content can be browsed from BIRD System WEB Node ENS-Lyon: [5] Node IGBMC: [6]
Data Selection by BIRD-QL Service
Data can also be selected with BIRD-QL queries; Expert users can however modify queries by hand. Trois query service are available:
1. curl -F upload@your_bird.ql 'http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql' 2. http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=.... 3. BIRD-QL Editor.
User can use this engine for intensive computation, download [birdql cmd].
Simple Services
Service : http://bird.u-strasbg.fr:8080/bird/bsearch?db=<database>&accession=<ac or id>&field=<DE,OS..> &format=<fasta/flat>
Example 1: get EST
http://bird.u-strasbg.fr:8080/bird/bsearch?db=gbest&accession=Cj133605&field=DE,OS,OC,TISSUE_TYPE,DEV_STAGE
Example 2: get Protein :
http://bird.u-strasbg.fr:8080/bird/bsearch?db=uniprot&accession=Q23456
Example 3: get PDB :
http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&idcode=1XDS
Example 4: get fasta :
http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&idcode=1XDS&format=fasta
WEB Server
http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do
API JAVA & SQL Native
The API is an Interface of programming which defines the way in which a data-processing component can communicate with another. The API Java of BIRD contains useful reusable classes by external modules of access to the databases. It has functions (methods) returning the data selected under various formats. The user of high-level can use the API to develop new functionalities exploiting of the data. It can also be used to make personalized graphic interfaces and Web Services. The codes Java below illustrate the exploitation of BIRD API. BIRDQL Engine doesn’t return data but just OIDs of selected records. The content of the record must then be searched by the API.
// API bird Import org.igbmc.bird.* Class ExampleUtilisationAPI {
InterfactDB birddb = new InterfacreDB(“my-bird”) // BIRD-QL birdql =”ID * DB UniProt WH DE contains .. FM OID” OID=birddb.queryengine.run(birdql); For { i=1 to N } { // result treatment UniProt obj=(UniProt)birddb.getObjet(OID[i]; …. } …
BIRD business intelligence
Kownledge Discovery in Biological Database
DB2 Miner Intelligent (API)
Example in BIRD System
BIRD Implementation
Architecture Federation
Data Model
Query Engine
Data Integration
Key Technologies
wwwww
BIRD System in Action
Decrypthon Data Center Implementation
http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do
Macsim uses BIRDQL engine
Macsim can now get direct connection with Bird
GPS uses BIRDQL engine
Gscope utilise BIRD
Gscope peut se mettre maintenant en connexion directe avec Bird
- proc BirdFromQueryText {Texte {OutFile ""} {BirdUrl ""}}
- proc BirdFromQueryFile {Fichier {OutFile ""} {BirdUrl ""}}
Bird sait intégrer les fiches infos d'un projet Gscope. On peut alors les interroger directement par http ou par Gscope ou, mieux, par des affiches avec la commande BirdGscopeSearch