Difference between revisions of "BIRD"

From Wikili
Jump to: navigation, search
(Theories and Functionalities)
(Association rule learning)
Line 273: Line 273:
 
====Association rule learning====
 
====Association rule learning====
 
a.'''What Is Association Rule Mining?'''
 
a.'''What Is Association Rule Mining?'''
 +
 +
Describing association relationships among the attributes in the set of relevant data
  
 
Frequent pattern mining: find all frequent patterns in a database
 
Frequent pattern mining: find all frequent patterns in a database
  
 
Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
 
Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
 
+
 
 
Frequent pattern mining: finding regularities in data
 
Frequent pattern mining: finding regularities in data
  
Line 287: Line 289:
  
 
b.'''Basic'''
 
b.'''Basic'''
 +
 +
Rule Definition
 +
 +
    Body ==> Consequent [ Support , Confidence ] 
 +
    (IF  <>  THEN <>)
 +
    Body: represents the examined data.
 +
    Consequent: represents a discovered property for the examined data.
 +
    Support: represents the percentage of the records satisfying the body or the consequent.
 +
    Confidence: represents the percentage of the records satisfying both the body and the 
 +
    consequent to those satisfying only the body
 +
 +
 +
  
 
Itemset: a set of items
 
Itemset: a set of items

Revision as of 06:18, 19 February 2008

BIRD System (BIRD): Biological Integration and Retrieval Data was designed by Hoan Nguyen at LBGI laboratory (POCH Team) of IGBMC[1] Strasbourg

What is BIRD System

Scientific Context

Since 2000, thanks to the availability of the human genome and the rapid progress of biotechnologies and information technologies, numerous large biomedical datasets have been generated. Thus, modern biomedical information corresponds to a high volume of heterogeneous data that doubles in size every year (Statistics NCBI) and that covers very different data types, including patient data (from phenotypic, environmental or behavioral origins), gene data (including genome environment, gene expression status, enzymatic activity, gene product modification…) and the processes, protocols or treatments used to generate the information. In this context, systemic approaches are now being developed to analyze and compare this huge amount of information, in order to identify genes and to predict their functions in the cascade of events and networks involved for example, in the emergence of a disease. This requires the development of dynamic and powerful systems to store, assemble, integrate and process very large datasets from different sources. Recently, the Decrypthon initiative (Decrypthon), resulting from a collaboration between AFM/CNRS/IBM, has been instigated, firstly to develop a computing grid that connects hundreds of processors installed in various data-processing centres of French universities and, secondly to provide a facilitated access to the data for the scientific biological community. In the framework of the Decrypthon initiative, several biomedical projects are in progress requiring on the one hand, a strong computational capacity and on the other hand, the deployment in the grid environment of a data integration system able to manage automatically large volumes of heterogeneous data and to quickly process complex queries and versioning management.

BIRD System Overview

BIRD System (Nguyen et al, CORIA 2008, Hermes Edition) was designed to manage large collections of biological data and to intensive computation and simulation. BIRD heritages somes main idea of Saada project[2]. A generic configurable data model has been designed and allows the simultaneous integration of genomics, transcriptomics and ontology datasets using a limited number of product mapping rules provided by the user (operator or system administrator). The integration rules allow the easy creation of the database according to semantic topics and real requirements. BIRD is driven with a high level query engine, based on SQL and a full text engine allowing the biologist to quickly extract knowledge without programming. Thanks to such an engine, the system is capable to generate the sub-bank of data in accordance with the real requirement.

The hosted data can be accessed by the community using various methods such as a Web interface, Http Service, an API Java or a BIRD-QL Engine Query.

BIRD System is developed with the Java technology. BIRD System uses IBM DB2 for data server; Websphere Federtion Server for virtual databases. The web application is hosted by a Tomcat Server or by a WebSphere Application Server.

BIRD System is not only a retrieval data system but also a plate-forme of Kownlegde Discovery in Biological Database. We use IBM Miner Intelligent (association rules, classification, ..) in order to develop the data mining model.


The first goal of Bird System is to implementation of the Décrypthon Data Center [3] [4] in the framework of Décrypthon Programme (AFM/CNRS/IBM ) [5]

DATABASES List

GENBANK, EST, WGS, REFSEQ, PDB, UNIPROT, UCSC, INTERPRO, GO, TAXONOMY, MACSIM, EVI-GENORET (local user), STRING (local user), UMD Data (local user), ...

BIRDQL Biological Query Language

BIRDQL in few words

The heterogeneous data integrated in BIRD System are represented by several relational tables. The exploitation of these data by SQL queries is not obvious except for developers or computer scientist experts.

Building queries with SQL in this context is not easy with because that requires to use joins (terme technique) to select data in multiple tables. This complexity must be hidden by HTML forms but a lot of queries can not be setup with HTML forms.

We proposes own query language (BIRDQL), there is new standard biological query language allowing the biologist or clinician to create data retrieval protocols without exhaustive knowledge of the data sources and their architecture. BIRD System is driven with a high level query engine: BIRDQL, which makes it possible for biologists to express easily queries and to extract knowledge by classical constraints and scientific functions (StructuralDistance,SequencePattern,AssociationRule...).

BIRDQL in not a mathematically complete language but indeed an idiom adpated to the GUI, human readable enough to be modified by hand.

BIRDQL Grammar

ID <list of id/ac/query_id > DB <bank names>

WH Field[6] Contains (kw1 & kw2) | kw_n

WH PATTERN <function SequencePattern() >

WH PATTERN <function DiagonalMolecule()>

WH PATTERN <function InteractionProtein()>

WH PATTERN <function AssociationRule()>

FD <Field[7] out>

LM <n>

FM Fasta/Flat/Xml/CSV/Simple/Object/OID

BIRDQL example

Two other examples below also show how to use the BIRD-QL syntax.

Example 1: simple query, search and fasta format generation


ID * DB UNIPROT

WH DE contains "synthetase" & "tyrosyl"

WH OX contains 382

FD AC, ID,DE,OX,SQ

FM FASTA


Result


>Q92PK5 | SYY_RHIME | Tyrosyl-tRNA synthetase (EC 6.1.1.1) (Tyrosine--tRNA ligase) (TyrRS). | 382 MSEFKSDFLHTLSERGFIHQTSDDAGLDQLFRTETVTAYIGFDPTAASLHAGGLIQIMMLHWLQATGHRPISLMGGGTGMVGDPSFKDEARQLMTPETI...

Example 2: complex query

ID * DB GENBANK, REFSEQ

WH OC Contains "Eukaryote"

WH DR Contains "GO"

WH GENE contains "GF100027"

FM FASTA

The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.


Example 2: complex query

ID * DB GENBANK, REFSEQ

WH OC Contains "Eukaryote"

WH DR Contains "GO"

WH GENE contains "GF100027"

FM FASTA

The query above allow to search in Genbank and RefSeq, the Eucaryotic sequences containing the GF100027 gene with a cross reference in GeneOntology.

Example 3: mining in EST

ID * DB GBEST

WH TISSUE_TYPE contains "retina"

WH DEV_STAGE contains "adult"

LM 100

FD AC,DE,OX,OC,tissue_type,dev_stage,chr

FM FLAT

Example 4: Mining in EST

ID CJ133635,CJ133593,CJ133659 DB GBEST

WH DE contains "AMINOTRANSFERASE"

WH OC contains "Eukaryota" & not "Metazoa"

WH TISSUE_TYPE contains "retina"

FD AC,DE,OX,OC,tissue_type,dev_stage,chr

FM FLAT


Example 5: Mining in EST

ID * DB GBEST

WH TISSUE_TYPE contains "colon"

WH DEV_STAGE contains "adult"

LM 100

FD AC,DE,OX,OC,tissue_type,dev_stage,chr,os

FM FLAT


Example 6: Mining In PDB

ID * DB PDB

WH DE contains "ERYTHRINA CORALLODENDRON LECTIN IN COMPLEX"

WH OS contains "ERYTHRINA CORALLODENDRON"

WH RESO contains 1.90

LM 10

FM FASTA


Example 7: running SQL Native (authorized user)

ID * DB STRING

WH SQLNATIVE select * from items.proteins

Limit 100

FM CSV


Example 8: Association rules (authorized user)

ID * DB protein_interaction

WH PATTERN AssociationPattern(Right(protA,ProtB,ProtC),Left(ProtK),sup=30,conf=90))

FD ID,Rules

FM FLAT

Data Access Protocoles

Data Browsing at Décrypthon Data Center

Database content can be browsed from BIRD System WEB Node ENS-Lyon: [8] Node IGBMC: [9]

Data Selection by BIRD-QL Service

Data can also be selected with BIRD-QL queries; Expert users can however modify queries by hand. Trois query service are available:

     1. curl -F upload=@your_bird.ql 'http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql'


     2. http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=your_birdql
     Example:  http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=
                ID * DB Uniprot--WH DE contains "histone"--LM 10--FD AC,DE--FM FLAT
               http://bird.u-strasbg.fr:8080/bird/bsearch?service=birdql&query=
                ID * DB Uniprot--WH DE contains "Helianthinin"--LM 10--FM FASTA


     3. BIRD-QL Editor (in pres).

User can use this engine for intensive computation, download [birdql cmd].

Simple Services-Bank ID

Service : http://bird.u-strasbg.fr:8080/bird/bsearch?db=<database>&accession=<ac or id>&field=<DE,OS..> &format=<fasta/flat>

Example 1: get EST

     http://bird.u-strasbg.fr:8080/bird/bsearch?db=gbest&accession=Cj133605&field=DE,OS,OC,TISSUE_TYPE,DEV_STAGE

Example 2: get Protein :

     http://bird.u-strasbg.fr:8080/bird/bsearch?db=uniprot&accession=Q23456

Example 3: get PDB :

    http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&accession=1XDS

Example 4: get Fasta :

    http://bird.u-strasbg.fr:8080/bird/bsearch?db=pdb&accession=1XDS&format=fasta

WEB Server

   http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do   (firefox)

API JAVA - BIRDQL Client

The API is an Interface of programming which defines the way in which a data-processing component can communicate with another. The API Java of BIRD contains useful reusable classes by external modules of access to the databases. It has functions (methods) returning the data selected under various formats. The user of high-level can use the API to develop new functionalities exploiting of the data. It can also be used to make personalized graphic interfaces and Web Services. The codes Java below illustrate the exploitation of BIRD API. BIRDQL Engine doesn’t return data but just OIDs of selected records. The content of the record must then be searched by the API.

// API BIRD

Import org.igbmc.bird.* Class ExampleUtilisationAPI {

 InterfactDB birddb  = new InterfaceDB(“my-bird”)
 // BIRD-QL
     String birdql = ” ID * DB UniProt
                     --WH OS contains "Mus mus"
                     --WH OC contains "Eukaryota" & not "Metazoa"
                     --FM OID”
 Vector OID=birddb.queryengine.run(birdql);  
 For { i=1 to OID.size()   }
  { // result treatment
    UniProt obj=(UniProt)birddb.getObject(OID[i]);
            ….
  }


// BIRDQL CLient

java org.igbmc.bird.datadiscovery.BirdQLClient birdql nameServer outFile

@birdql  : file name contains your bird-ql query

@nameServer: name of BIRD Server (d1.crihan.fr or bird.u-strasbg.fr)

@outFile  : file name, the result will be print to this file

BIRD business intelligence

Theories and Functionalities

Error creating thumbnail: Unable to save thumbnail to destination


Error creating thumbnail: Unable to save thumbnail to destination


Association rule learning

a.What Is Association Rule Mining?

Describing association relationships among the attributes in the set of relevant data

Frequent pattern mining: find all frequent patterns in a database

Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]

Frequent pattern mining: finding regularities in data

 +What products were often purchased together?  Beer and diapers?!
 +What are the subsequent purchases after buying a product( ex. car)?
 +Can we automatically profile patient or gene ?

b.Basic

Rule Definition

   Body ==> Consequent [ Support , Confidence ]   
   (IF  <>  THEN <>)
   Body: represents the examined data. 
   Consequent: represents a discovered property for the examined data. 
   Support: represents the percentage of the records satisfying the body or the consequent. 
    Confidence: represents the percentage of the records satisfying both the body and the   
    consequent to those satisfying only the body



Itemset: a set of items

=>E.g., acm={a, c, m}

Support of itemsets

=>Sup(acm)=3

Given min_sup=3, acm is a frequent pattern

Frequent pattern mining: find all frequent patterns in a database


Error creating thumbnail: Unable to save thumbnail to destination


c.Apriori Algorithm


  Ck: Candidate itemset of size k
  Lk : frequent itemset of size k
  L1 = {frequent items};
  for (k = 1; Lk !=Q; k++) do
    Ck+1 = candidates generated from Lk;
    for each transaction t in database do increment the count of all candidates in Ck+1 that are 
    contained in t
    Lk+1 = candidates in Ck+1 with min_support
  return UkLk; (Union)
Error creating thumbnail: Unable to save thumbnail to destination

Kohonen´s feature maps

  A Kohonen’s	self organizing feayture map (K-map) is uses analogy with such biological neural
  structures where the placement of neurons is orderly and reflects structure of external (sensed)
  stimuli (e.g. in auditory and visual pathways).
  K-map  learns, when continuous-valued input vectors are presented to it without specifying the 
  desired output. The weights of connections can adjust to regularities in input. Large number of
  examples is needed.
  K-map  mimics well learning in biological neural structures. It is usable in speech recognizer
  This is a flat (two-dimensional) structure with connections between neighbors and connections 
  from each input node to all its output nodes.
  It learns clusters of input vectors without any help from teacher. Preserves closeness (topolgy).

Learning in K-maps

  1. Initialize weights to small random numbers and set initial radius of neighborhood of nodes.
  2. Get an input x1, …, xn.
  3. Compute distance dj to each output node:
     dj =  (xi - wij)2
  4. Select output node s with minimal distance ds. 
  5. Update weights for the node s and all nodes in its neighborhood:
     wij´= wij + h* (xi - wij), where h<1 is a gain that decreases in time.
  Repeat steps 2 - 5.

DB2 Miner Intelligent (API)

Error creating thumbnail: Unable to save thumbnail to destination

Data flow of the mining procedure (FindDeviations) Finding deviations

Finding groups with similar characteristics (ClusterTable procedure)

  You can find groups with similar characteristics by using the ClusterTable procedure. 
  When to do it:
  The database might contain patient data including demographic data, for example: v Gender v Age v
  Profession v Family statusThe information might also include the income or the socio-demographic group of the customer


Finding relationships (FindRules procedure) You can find relationships in your data by using the FindRules procedure.


Predicting future behavior (PredictColumn procedure)

  In the tables or views of your database (Transciptomic or clinical Data), there might 
  be one column that you are particularly interested in. In the clinical data, you can find    
  relations between symptoms and diseases. With this information, you can predict the potential diseases of new patients

Finding most important fields (FindMostImpFields procedure)

  You can find most important fields by using the FindMostImpFields procedure.

Kownledge Discovery in Biological Database

Transciptomic

Protein Protein Interaction Pattern

BIRD System in Action

Décrypthon Data Center

Overview

Error creating thumbnail: Unable to save thumbnail to destination

BIRD System is core of Décrypthon Data Center.

  Sharing of large scare biological data for applications (Macsim, MS2PH, Macgos, Ordali..)   
  runing on Décrypthon Grid.
  Managing of generated data (result) on grid   
  Sharing of data and services for scientific community
  http://bird.u-strasbg.fr:9080/BirdSystem/HomePage.do


File:Bird ddc.jpg

Macsim uses BIRDQL engine

MACSIMS:Multiple Alignment of Complete Sequences Information Management System (Thompson et al, 2006).MACSIMS provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist .

Macsim gets direct connection with Bird database

GPS uses BIRDQL engine

http://nucleic.fr

Gscope utilise BIRD

Gscope peut se mettre maintenant en connexion directe avec Bird


  • proc BirdFromQueryText {Texte {OutFile ""} {BirdUrl ""}}
  • proc BirdFromQueryFile {Fichier {OutFile ""} {BirdUrl ""}}

Bird sait intégrer les fiches infos d'un projet Gscope. On peut alors les interroger directement par http ou par Gscope ou, mieux, par des affiches avec la commande BirdGscopeSearch

BIRD Development

BIRD Implementation

Origin BIRD System

BIRD was based on main principe of Saada project [10].

  SAADA - Systèm d’Archivage Automatique des Données Astronomiques
  First Goal : Archive & Exploitation of Data of  the European XMM Newton satellite  [11].The 2XMM catalogue of X-ray sources, 
    the largest of its kind ever, has now been released.
  In a PhD Framework (2002-2005, Prototype Saada V.1.3) of Dr.NGUYEN  at University of Strasbourg I,Supported by the  
   CNES[www.cnes.fr] and the Alsace Region,  Supervised by Dr. Michel and Dr.Motch. 


Error creating thumbnail: Unable to save thumbnail to destination

Conceptual Data Model

In order to automatically integrate heterogeneous data, we have designed several business data model corresponding to the real format of the data banks. Figure below illustrates the conceptual data model of the BIRD system. It can simultaneously host several bank types. Each type can itself concern several User Defined banks having the same format. Thanks to this conceptual model, BIRD can host different versions of a given data bank and manage them so that the programs launched on grid computing within an application can exploit the same data version during their computation time.

Error creating thumbnail: Unable to save thumbnail to destination


The data model of a data bank is predefined in an XML configuration file . This metadata is used to create the Java and SQL code. The code generation is launched at the moment of the configuration of the BIRD data bank or at data loading or reloading. In the example given figure 3, the metadata of Genbank are used to create Genbank-EST, Genbank-Refseq.

Error creating thumbnail: Unable to save thumbnail to destination


This Figure illustrates the business model for Genbank. Each bank can have several associated entries. Each entry has its associated information like Dbref, SEQData, FTSource,... According to our design, the Java classes of the business models are automatically generated by BIRD. Only instances of classes inherited from super ObjectPersistence will be recognized by the BIRD API. This super class contains common attributes and methods for all generated classes.

Query Engine

Data Integration

Error creating thumbnail: Unable to save thumbnail to destination

The creation of a database goes through some principal stages : Initially the relational schema system (meta-model) is created when BIRD is installed. In the second phase, the configuration module creates the business data model including SQL and Java codes corresponding to the predefined metadata given by the XML configuration. Then, the system analyzes some integration rules to select the data files and proceed to their conversion and loading into the relational tables .

Architecture

Error creating thumbnail: Unable to save thumbnail to destination

Key Technologies

Relational Core store

  IBM DB2 WareHouse V9.1 
  WebSphere Federation Server 

WEB Server & Services

  IBM WebSphere Application Server ( main Portal)
  Tomcat Server (services, non graphic)
  Hibernate and JSF-Java Server Face 
  Object Relational Mapping 
  Web component

XML & JAVA

Project Distribution

  Not net to public

Publications

To cite BIRD System, please use the following publication;

1. Nguyen H., Berthommier G., Friedrich A., Poidevin L. ,Ripp R. , Moulinier L. and Poch O. Introduction du nouveau centre de données biomédicales Décrypthon, CORIA 2008.

2. "Conception of the BIRD System" is preparing for .....

3. "BIRDQL-A new Biological Query Language " is preparing for....


Contact

  Nguyen Ngoc Hoan,PhD
  IGBMC Strasbourg
  1 rue Laurent Fries
  BP 10142
  67404 Illkirch CEDEX / France 
  Mail:nguyen@igbmc.fr
  Tel: 0033 388653302

--Nguyen 15:07, 16 February 2008 (CET)---

FAQ?