Difference between revisions of "BIRDQL"

From Wikili
Jump to: navigation, search
(BIRDQL in few words)
(BIRDQL example)
 
(34 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==[[BIRDQL]] Biological Query Language ==
 
==[[BIRDQL]] Biological Query Language ==
 
===BIRDQL in few words===
 
===BIRDQL in few words===
The heterogeneous data integrated in [[BIRD]] System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.  
+
This query language is conceived by Hoan Nguyen[http://lbgi.igbmc.fr/~nguyen/].
 +
 
 +
The heterogeneous data integrated in integrator system or [[BIRD]] System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.  
  
 
Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.  
 
Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.  
Line 7: Line 9:
 
We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).  
 
We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).  
  
BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand.
+
BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand. The construction of this BIRDQL query engine was used some main idea from SaadaQL [http://amwdb.u-strasbg.fr/saada/spip.php?article52]. SaadaQL query language was developed in the framework of my PhD ( Astrophysics & Virtual Observatory  ,2002-2005) at university of Strasbourg.
  
 
Data can be selected with [[BIRD Data Access Protocol]]
 
Data can be selected with [[BIRD Data Access Protocol]]
Line 13: Line 15:
 
===BIRDQL Grammar ===
 
===BIRDQL Grammar ===
  
ID  <list of id/ac/query_id > DB  <bank names>
+
ID  <list of id/ac/query_id > DB  [[Bird_Databases_List | <bank names>]]
  
WH  Field[http://d1.crihan.fr:8080/bird/bsearch?service=metadata&db=all] Contains (kw1 & kw2) | kw_n  
+
WH  [http://bird.u-strasbg.fr:8080/bird/bsearch?service=metadata&db=all <Field>] Contains <(kw1 & kw2) | kw_n>
  
 
WH  PATTERN <function SequencePattern() >
 
WH  PATTERN <function SequencePattern() >
Line 25: Line 27:
 
WH  PATTERN <function AssociationRule()>
 
WH  PATTERN <function AssociationRule()>
  
FD  <Field[http://d1.crihan.fr:8080/bird/bsearch?service=metadata&db=all] out>
+
WH  SQLNative select  from ...
 +
 
 +
FD  <[http://bird.u-strasbg.fr:8080/bird/bsearch?service=metadata&db=all Field out1,Field out2,...] / GET_COUNT/GET_DR(bankname)>
 +
 
 +
OF  <OFFSET, Default OF=0>
  
LM  <n>
+
LM  <number of maximum display>
  
FM  Fasta/Flat/Xml/CSV/Simple/Object/OID
+
FM  <Fasta/Flat/Xml/CSV/Simple/Object/OID>
  
  
Line 39: Line 45:
 
Data can be selected with [[BIRD Data Access Protocol]]
 
Data can be selected with [[BIRD Data Access Protocol]]
  
Two other examples below also show how to use the BIRD-QL syntax.
 
  
'''Example 1''': simple query, search and fasta format generation
+
Examples below also show how to use the BIRD-QL syntax.
 +
 
 +
 
 +
'''Example ''': simple query, Full Text search
 +
 
 +
 
 +
ID * DB MSV3d (Missense Variant Database)
 +
 
 +
WH TEXT contains "DMD"
 +
 
 +
FD ID
 +
 
 +
LM 100
 +
 
 +
FM JSON
 +
 
 +
Result
 +
 
 +
 
 +
//
 +
 
 +
'''Example ''': simple query, search and fasta format generation
  
  
 
ID * DB UNIPROT
 
ID * DB UNIPROT
  
WH DE contains "synthetase" & "tyrosyl"
+
WH TEXT contains "synthetase" & "tyrosyl" & not ("homo sapiens" & "human")
  
WH OX contains 382
+
FD AC, ID,DE,OX,SQ
  
FD AC, ID,DE,OX,SQ
+
LM 100
  
 
FM FASTA
 
FM FASTA
 
  
 
Result
 
Result
Line 61: Line 86:
 
MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...
 
MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...
  
'''Example 2''': complex query
+
//
 +
 
 +
 
 +
 
 +
'''Example ''': DBSNP
 +
 
 +
 
 +
'''Example ''':
 +
 
 +
get  DBSNP in XML by ID
 +
 
 +
//
 +
 
 +
ID 268 DB DBSNP
 +
 
 +
 
 +
 
 +
find snp by position
 +
 
 +
//
 +
 
 +
ID * DB DBSNP
 +
 
 +
WH SQLNative select id from dbsnp_ds_ch3.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=30466018] ' passing text as "i")
 +
 
 +
LM 1000
 +
 
 +
FM FLAT
 +
 
 +
'''Example ''': find snp by position
 +
 
 +
ID * DB DBSNP
 +
WH SQLNative select id from dbsnp_ds_ch18.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt>=30466000 and @physMapInt<=30466200 ] ' passing text as "i")
 +
FM FLAT
 +
//
 +
 
 +
 
 +
 
 +
'''Example ''': find snp by position and reference sequence (GRCh37.p5)
 +
 
 +
 
 +
ID * DB  DBSNP
 +
WH SQLNative Select ID from dbsnp_ds_ch8.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=19817621 and ../../@groupLabel="GRCh37.p5"] ' passing text as "i")
 +
FM FLAT
 +
//
 +
 
 +
 
 +
//
 +
 
 +
ID  * DB UNIPROT
 +
 
 +
WH TEXT contains "histone" & not "homo sapiens"
 +
 
 +
FD AC,DE,OS
  
ID * DB GENBANK, REFSEQ
+
LM 3
  
WH OC Contains "Eukaryote"
+
FM FLAT
  
WH DR Contains "GO"
+
//
  
WH GENE contains "GF100027"
+
ID  * DB UNIPROT
  
FM FASTA
+
WH TEXT contains  not "homo sapiens"
  
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
+
FD AC,DE,OS
  
 +
LM 3
  
'''Example 2''': complex query
+
'''Example 2''': complex query, GBFULL=EST+ WGS +Release +New
  
ID * DB GENBANK, REFSEQ
+
ID * DB GBFULL
  
 
WH OC Contains "Eukaryote"
 
WH OC Contains "Eukaryote"
Line 88: Line 167:
 
FM FASTA
 
FM FASTA
  
The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
+
The query above allow to search in Genbank full, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.
  
'''Example 3''': mining in EST
+
'''Example 3''': mining in GENBANK EST
  
 
ID * DB GBEST
 
ID * DB GBEST
Line 104: Line 183:
 
FM FLAT
 
FM FLAT
  
'''Example 4''': Mining in EST
+
'''Example 4''': Mining in GENBANK EST
  
 
ID CJ133635,CJ133593,CJ133659 DB GBEST
 
ID CJ133635,CJ133593,CJ133659 DB GBEST
Line 138: Line 217:
 
ID * DB PDB
 
ID * DB PDB
  
WH DE contains "ERYTHRINA CORALLODENDRON LECTIN IN COMPLEX"
+
WH TEXT contains "DMD" & "ERYTHRINA CORALLODENDRON"
 
 
WH OS contains "ERYTHRINA CORALLODENDRON"
 
 
 
WH RESO contains 1.90
 
  
 
LM 10
 
LM 10
Line 153: Line 228:
 
ID * DB PDB
 
ID * DB PDB
  
WH CL contains "METAL BINDING PROTEIN"
+
WH TEXT contains "METAL BINDING PROTEIN" & "LACTOFERRIN"
 
 
WH DE contains "LACTOFERRIN"
 
  
 
WH FUNCTION Diagnonal3D()>125
 
WH FUNCTION Diagnonal3D()>125
Line 168: Line 241:
  
 
ID * DB PDB
 
ID * DB PDB
WH CL contains "METAL BINDING PROTEIN"
 
  
WH DE contains "LACTOFERRIN"
+
WH TEXT "METAL BINDING PROTEIN" & "LACTOFERRIN"
  
 
WH FUNCTION Diagnonal3D()>125
 
WH FUNCTION Diagnonal3D()>125
Line 212: Line 284:
  
  
'''Example 7''': rmining in EVI Genoret Database
 
 
ID * DB EVImm
 
 
WH text CONTAINS "retina"
 
 
FD GET_COUNT
 
 
FM CSV
 
 
//
 
 
ID * DB EVImm
 
 
WH text CONTAINS "retina"
 
 
LM 10
 
 
FD NOM,VALIDEPROT
 
 
FM CSV
 
 
//
 
  
ID * DB EVImm
 
  
WH text CONTAINS "retina"
 
  
LM 10
+
'''Example 7''': Get GENE ONTOLOGY or DBREF
  
FD NOM,VALIDEPROT
+
ID Q32437 DB UNIPROT
  
FM FLAT
+
FD AC,DR(GO)
  
 
//
 
//
  
ID * DB EVImm
+
ID Q34215 DB UNIPROT
 
+
FD AC,DR(InterPro)
WH text CONTAINS "retina" & "chr6:127091327-127116667"
 
 
 
LM 10
 
 
 
FM XML
 
 
 
 
 
'''Example 8''': running SQL Native (authorized user)
 
  
ID * DB STRING
 
  
WH SQLNATIVE select * from items.proteins
+
>>Result:
  
Limit 100
+
    AC  Q32437;
 +
    DR  GO; GO:0009507; C:chloroplast; IEA:InterPro.
 +
    DR  GO; GO:0016021; C:integral to membrane; IEA:UniProtKB-KW.
 +
    ......
 +
    //
 +
    AC  Q34215;
 +
    DR  Pfam; PF00033; Cytochrom_B_N; 1.
  
FM CSV
 
  
  
'''Example 9''': SAGE Data (or MYGEO [http://bird.u-strasbg.fr:8080/bird/temp/mygeoschema.pdf]) (authorized user)
 
  
  
ID * DB MYGEO
 
  
WH SQLNATIVE select sample_id,tag,count from MYGEO.entry_sage where tag='GTGAAACCCC'
 
  
LM 500
 
  
FM CSV
 
  
==> results
 
  
GSM1|GTGAAACCCC|63
 
  
GSM2|GTGAAACCCC|76
 
  
GSM571|GTGAAACCCC|13
+
[[Category:Bird_project]]
 
 
GSM572|GTGAAACCCC|395
 
 
 
GSM573|GTGAAACCCC|359
 
 
 
GSM574|GTGAAACCCC|286
 
 
 
GSM668|GTGAAACCCC|132
 
 
 
GSM669|GTGAAACCCC|129
 
 
 
GSM670|GTGAAACCCC|112
 
 
 
 
 
//
 
 
 
ID * DB MYGEO
 
 
 
WH SQLNATIVE select e.sample_id,e.tag,e.count from MYGEO.entry_sage as e, MYGEO.sample as h where  (h.ac=e.sample_id  and tag='GTGAAACCCC') and h.platform_id='GPL4' and count between 1 and 3
 
 
 
LM 1000
 
 
 
FM CSV
 
 
 
 
 
Results
 
 
 
GSM135388|GTGAAACCCC|1|GPL4
 
 
 
 
 
//
 
 
 
ID * DB MYGEO
 
 
 
WH SQLNATIVE select sample_id,tag,count from MYGEO.entry_sage where tag='GTGAAACCCC' and count between 1 and 3
 
 
 
LM 1000
 
 
 
FM CSV
 
 
 
 
 
'''Example 10''': Association rules (authorized user)
 
 
 
ID * DB protein_interaction
 
 
 
WH PATTERN AssociationPattern(Right(protA,ProtB,ProtC),Left(ProtK),sup=30,conf=90))
 
 
 
FD ID,Rules
 
 
 
FM FLAT
 

Latest revision as of 10:04, 10 March 2014

BIRDQL Biological Query Language

BIRDQL in few words

This query language is conceived by Hoan Nguyen[1].

The heterogeneous data integrated in integrator system or BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.

Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.

We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).

BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand. The construction of this BIRDQL query engine was used some main idea from SaadaQL [2]. SaadaQL query language was developed in the framework of my PhD ( Astrophysics & Virtual Observatory ,2002-2005) at university of Strasbourg.

Data can be selected with BIRD Data Access Protocol

BIRDQL Grammar

ID <list of id/ac/query_id > DB <bank names>

WH <Field> Contains <(kw1 & kw2) | kw_n>

WH PATTERN <function SequencePattern() >

WH PATTERN <function DiagonalMolecule()>

WH PATTERN <function InteractionProtein()>

WH PATTERN <function AssociationRule()>

WH SQLNative select from ...

FD <Field out1,Field out2,... / GET_COUNT/GET_DR(bankname)>

OF <OFFSET, Default OF=0>

LM <number of maximum display>

FM <Fasta/Flat/Xml/CSV/Simple/Object/OID>


Error creating thumbnail: Unable to save thumbnail to destination

BIRDQL example

Data can be selected with BIRD Data Access Protocol


Examples below also show how to use the BIRD-QL syntax.


Example : simple query, Full Text search


ID * DB MSV3d (Missense Variant Database)

WH TEXT contains "DMD"

FD ID

LM 100

FM JSON

Result


//

Example : simple query, search and fasta format generation


ID * DB UNIPROT

WH TEXT contains "synthetase" & "tyrosyl" & not ("homo sapiens" & "human")

FD AC, ID,DE,OX,SQ

LM 100

FM FASTA

Result


>Q92PK5 | SYY_RHIME | Tyrosyl-tRNA synthetase (EC 6.1.1.1) (Tyrosine--tRNA ligase) (TyrRS). | 382 MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...

//


Example : DBSNP


Example :

get DBSNP in XML by ID

//

ID 268 DB DBSNP


find snp by position

//

ID * DB DBSNP

WH SQLNative select id from dbsnp_ds_ch3.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=30466018] ' passing text as "i")

LM 1000

FM FLAT

Example : find snp by position

ID * DB DBSNP WH SQLNative select id from dbsnp_ds_ch18.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt>=30466000 and @physMapInt<=30466200 ] ' passing text as "i") FM FLAT //


Example : find snp by position and reference sequence (GRCh37.p5)


ID * DB DBSNP WH SQLNative Select ID from dbsnp_ds_ch8.fulltext where XMLEXISTS('$i/Rs/Assembly/Component/MapLoc[@physMapInt=19817621 and ../../@groupLabel="GRCh37.p5"] ' passing text as "i") FM FLAT //


//

ID * DB UNIPROT

WH TEXT contains "histone" & not "homo sapiens"

FD AC,DE,OS

LM 3

FM FLAT

//

ID * DB UNIPROT

WH TEXT contains not "homo sapiens"

FD AC,DE,OS

LM 3

Example 2: complex query, GBFULL=EST+ WGS +Release +New

ID * DB GBFULL

WH OC Contains "Eukaryote"

WH DR Contains "GO"

WH GENE contains "GF100027"

FM FASTA

The query above allow to search in Genbank full, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.

Example 3: mining in GENBANK EST

ID * DB GBEST

WH TISSUE_TYPE contains "retina"

WH DEV_STAGE contains "adult"

LM 100

FD AC,DE,OX,OC,tissue_type,dev_stage,chr

FM FLAT

Example 4: Mining in GENBANK EST

ID CJ133635,CJ133593,CJ133659 DB GBEST

WH DE contains "AMINOTRANSFERASE"

WH OC contains "Eukaryota" & not "Metazoa"

WH TISSUE_TYPE contains "retina"

FD AC,DE,OX,OC,tissue_type,dev_stage,chr

FM FLAT


Example 5: Mining in EST

ID * DB GBEST

WH TISSUE_TYPE contains "colon"

WH DEV_STAGE contains "adult"

LM 100

FD AC,DE,OX,OC,tissue_type,dev_stage,chr,os

FM FLAT


Example 6: Mining In PDB

ID * DB PDB

WH TEXT contains "DMD" & "ERYTHRINA CORALLODENDRON"

LM 10

FM FASTA


//

ID * DB PDB

WH TEXT contains "METAL BINDING PROTEIN" & "LACTOFERRIN"

WH FUNCTION Diagnonal3D()>125

FUZZY 100

LM 100

FM FASTA

//

ID * DB PDB

WH TEXT "METAL BINDING PROTEIN" & "LACTOFERRIN"

WH FUNCTION Diagnonal3D()>125

FUZZY 100

LM 100

FM SIMPLE


//

ID * DB PDB

WH CL contains "METAL BINDING PROTEIN"

WH DE contains "LACTOFERRIN"

WH FUNCTION Diagnonal3D()>125


LM 10

FM FLAT


//

ID * DB PDB

WH CL contains "METAL BINDING PROTEIN"

WH DE contains "LACTOFERRIN"

WH FUNCTION Diagnonal3D()>125

FD GET_COUNT

FM FLAT



Example 7: Get GENE ONTOLOGY or DBREF

ID Q32437 DB UNIPROT

FD AC,DR(GO)

//

ID Q34215 DB UNIPROT FD AC,DR(InterPro)


>>Result:

   AC   Q32437;
   DR   GO; GO:0009507; C:chloroplast; IEA:InterPro.
   DR   GO; GO:0016021; C:integral to membrane; IEA:UniProtKB-KW.
   ......
   //
   AC   Q34215;
   DR   Pfam; PF00033; Cytochrom_B_N; 1.