BIRDQL
Contents
BIRDQL Biological Query Language
BIRDQL in few words
The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.
Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.
We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).
BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand. The construction of this BIRDQL query engine was used some main idea from SaadaQL [1]. SaadaQL query language was developed in my PhD ( Astrophysics & Virtual Observatory ,2002-2005).
Data can be selected with BIRD Data Access Protocol
BIRDQL Grammar
ID <list of id/ac/query_id > DB <bank names>
WH <Field> Contains <(kw1 & kw2) | kw_n>
WH PATTERN <function SequencePattern() >
WH PATTERN <function DiagonalMolecule()>
WH PATTERN <function InteractionProtein()>
WH PATTERN <function AssociationRule()>
WH SQLNative select from ...
FD <Field out1,Field out2,... / GET_COUNT/GET_DR(bankname)>
OF <OFFSET, Default OF=0>
LM <number of maximum display>
FM <Fasta/Flat/Xml/CSV/Simple/Object/OID>
BIRDQL example
Data can be selected with BIRD Data Access Protocol
Two other examples below also show how to use the BIRD-QL syntax.
Example 1: simple query, search and fasta format generation
ID * DB UNIPROT
WH TEXT contains "synthetase" & "tyrosyl" & not ("homo sapiens" & "human")
FD AC, ID,DE,OX,SQ
LM 100
FM FASTA
Result
>Q92PK5 | SYY_RHIME | Tyrosyl-tRNA synthetase (EC 6.1.1.1) (Tyrosine--tRNA ligase) (TyrRS). | 382
MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...
Example 2: FullText query with operator: & , not (TEXT=definition, organism scientific organism common, dbref,..)
ID * DB REFSEQ
WH TEXT Contains "Tyrosyl-tRNA synthetase" & "Homo sapiens"
LM 100
FM FASTA
//
ID * DB UNIPROT
WH TEXT contains "histone" & not "homo sapiens"
FD AC,DE,OS
LM 3
FM FLAT
//
ID * DB UNIPROT
WH TEXT contains not "homo sapiens"
FD AC,DE,OS
LM 3
Example 2: complex query, GBFULL=EST+ WGS +Release +New
ID * DB GBFULL
WH OC Contains "Eukaryote"
WH DR Contains "GO"
WH GENE contains "GF100027"
FM FASTA
The query above allow to search in Genbank full, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
Example 3: mining in GENBANK EST
ID * DB GBEST
WH TISSUE_TYPE contains "retina"
WH DEV_STAGE contains "adult"
LM 100
FD AC,DE,OX,OC,tissue_type,dev_stage,chr
FM FLAT
Example 4: Mining in GENBANK EST
ID CJ133635,CJ133593,CJ133659 DB GBEST
WH DE contains "AMINOTRANSFERASE"
WH OC contains "Eukaryota" & not "Metazoa"
WH TISSUE_TYPE contains "retina"
FD AC,DE,OX,OC,tissue_type,dev_stage,chr
FM FLAT
Example 5: Mining in EST
ID * DB GBEST
WH TISSUE_TYPE contains "colon"
WH DEV_STAGE contains "adult"
LM 100
FD AC,DE,OX,OC,tissue_type,dev_stage,chr,os
FM FLAT
Example 6: Mining In PDB
ID * DB PDB
WH TEXT contains "DMD" & "ERYTHRINA CORALLODENDRON"
LM 10
FM FASTA
//
ID * DB PDB
WH TEXT contains "METAL BINDING PROTEIN" & "LACTOFERRIN"
WH FUNCTION Diagnonal3D()>125
FUZZY 100
LM 100
FM FASTA
//
ID * DB PDB
WH TEXT "METAL BINDING PROTEIN" & "LACTOFERRIN"
WH FUNCTION Diagnonal3D()>125
FUZZY 100
LM 100
FM SIMPLE
//
ID * DB PDB
WH CL contains "METAL BINDING PROTEIN"
WH DE contains "LACTOFERRIN"
WH FUNCTION Diagnonal3D()>125
LM 10
FM FLAT
//
ID * DB PDB
WH CL contains "METAL BINDING PROTEIN"
WH DE contains "LACTOFERRIN"
WH FUNCTION Diagnonal3D()>125
FD GET_COUNT
FM FLAT
Example 7: Get GENE ONTOLOGY or DBREF
ID Q32437 DB UNIPROT
FD AC,DR(GO)
//
ID Q34215 DB UNIPROT FD AC,DR(InterPro)
>>Result:
AC Q32437; DR GO; GO:0009507; C:chloroplast; IEA:InterPro. DR GO; GO:0016021; C:integral to membrane; IEA:UniProtKB-KW. ...... // AC Q34215; DR Pfam; PF00033; Cytochrom_B_N; 1.
Example 9: DBSNP
Example 9.1: get DBSNP with XML format
ID 268 DB DBSNP
Example 9.2: find snp by position
ID * DB DBSNP
WH SQLNative select id from dbsnp_ds_ch3.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=30466018] ' passing text as "i")
LM 1000
FM FLAT
Example 9.2: find snp by position
ID * DB DBSNP WH SQLNative select id from dbsnp_ds_ch18.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt>=30466000 and @physMapInt<=30466200 ] ' passing text as "i") FM FLAT //
Example 9.3: find snp by position and reference sequence (GRCh37.p2)
ID * DB DBSNP WH SQLNative Select ID from dbsnp_ds_ch8.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=19817621 and ../../@groupLabel="GRCh37.p2"] ' passing text as "i") FM FLAT //