recherche NCBI : multiple sequence alignment editor


trois points :
- multi-scale analysis : i) taxa -- cluster - all sequences, ii) large regions - motifs - residue, iii) alignment - structure - phylogeny environment.
- alignment manamgement porvides recording of alignments snapshots 
- ease of use, multi-plateform -> benefits to all researchers even students 


ABSTRACT
OBJECTIVEa
Multiple sequences alignments play a key role in modern bioinformatics as becoming cornerstones of several types of investigations like protein family analysis, evolutionary inferred studies or comparative genomics. Depending on the study, the alignment is used to provide or host information. These information range from domain, motifs, protein secondary structure, hydrophobicity, phylogenetic tree, etc ... Numerous standalone programs can access these information and generate disconnected output files.

FINDINGS :
We present here ORDALIE, an integrated workbench for manipulating and exploring the informational content of a multiple sequence alignment. ORDALIE is arranged around an internal SQLite database that allows storage and retreival of information. User interactions with ORDALIE creates mdified alignment (different clustering, different ways of aligning sequences) that can be stored and retreived at any time. Analysis tools such as tree builder, 3D model viewer, feature mapping, sequence clustering, sequence conservation computation, are also provided to decipher the informational content of the alignment and their associated data are also stored into the database.

CONCLUSION:
Although many software already exist to handle multiple sequence alignment or extract information out of them, ,these programs remain separated entities and do not give an access to the global information sphere attached to a given alignment. By embedding a database system with alignment manipulation and exploration tools, the ORDALIE platform gives a persistence of data decephered from alignment analysis. It also gives interconnecting tools allowing a broad variety of information mining for alignment exploitation.


INTRODUCTION:
Multiple sequences alignments (MSAs) play a key role in modern bioinformatics as becoming cornerstones of several types of investigations like protein family analysis, evolutionary inferred studies or comparative genomics. MSAs are also used as support in studies like point mutation pathogenicity inference. 
As a consequence, MSAs become a hub of information as being both the main data source for some applications (phylogeny, motif discovery, etc ...) and the host for data coming from diverse studies.
In order to ensure a high quality of data deciphering or mapping information, the need of accurate alignments is then crucial. It should be stressed here that although the research on algorithms dedicated to align sequences is still intensive and the outcoming softwares are more and more accurate, the need of manual inspection and/or curation of alignments is still necessary.

MSAs are not static objects. Even if, formly, a set of aligned sequences constitutes a multiple sequence alignment, in real life there are as many different aligments as undertaken studies types for a same set of aligned sequences. Indeed, a study may need a specific sequence ordering, a different sequence clustering for example. Several solutions for aligning residues may also exists leading to different alignments. 
Numerous programs for manipulating MSAs do exist even if few of them are used in practice. They usually do not allow several views of the same set of sequences while working with them, and their associated alignment editor facilities are usually limited or hard to use. 

We present here ORDALIE (ORDered ALignment Information Explorer), a workbench dedicated to the analysis and exploration of the informational content of a MSA. ORDALIE manages, manipulates and stores instances of alignments, hereafter called "snapshots",  and their associated features data if any. It  also provides several tools to extract and decipher the intrinsic alignment information. 


THE ORDALIE CORE
The database:
The core of ORDALIE is build around a SQLite database which scheme is given in figure \cite(db-scheme}. ORDALIE takes advantage of this underlaying database to store snapshots of alignments and their associated features, allowing the evaluation of defferent hyprothesis of residues alignment or sequence clustering for example. 
The "ordalie" table contains settings parameters saved at exit allowing the user to find the same state when launching ORDALIE again. 
The "seqinfo" table contains sequence information that are not linked to aminoacids position (length, molecular weight, isoelectric point, ...)
The "seqfeat" table is used to store features data mapped onto the residue sequence. 

Alignement Management :
ORDALIE reads and writes alignments in MSF, Fasta, ClustalW, MACSIMS/XML file formats and the specific ORDALIE file format. Once loaded, ORDALIE allows standard alignment edition, like cut, copy and paste of sequences. Empty sequences ("separators") can be inserted or removed in order to create user-defined groups outside the "Cluster" tool. The whole or parts of the MSA can be printed for publication issues.

Editing :
In order to perform fast alignment editing, we develop a dedicated Tk widget from scratch written in C for performance issues. The "Editor" functionnalities are based and extend the famous SeqLab editor \cite{seqlab} that was part of the GCG Wisconsin package. By default, the user can only insert/romove gaps inside one or several sequences. Options for grouping/ungrouping sequences and removing columns of gaps are provided. Amino acids may be edited after sequence unlocking. After edition, the user can save or discard the modified aligment. 


THE ORDALIE TOOLS
Trees :
ORDALIE allows building of phylogenetic tree based on all or parts of the aligned sequences, and all or parts of the alignment columns. ORDALIE computes a distance matrix using the selected set of sequences and columns. The phylogenetic tree is infered from this distance matrix using the FastME program \cite{soft-fastme}. Although likelihood-based tree are more reliable, the speed and reliability of FastME is enough to have a first insight into the protein phylogeny. The robustness of the tree nodes can be assessed through bootstrap scoring. 
The computed tree is then displayed in a dedicated window. The tree can be viewed as a dendrogram or as a radial tree. The user can re-root the tree, swap branches, display bootstrap values, show nodes abave a bootstrap threshold, change branch labelling, print the tree and more.

Clustering:
The analysis of the differences between sequences inside a MSA is usually an important souce of information. Some sequence clustering may be obvious, as partitioning sequences according to their catalytic activities if several ones exists, or partitioning according to the life domain to which the sequences belongs to. ORDALIE provides a clustering mode allowing clustering sequences on all or parts of the columns, using several criterions and five clustering algorithms. The criterions are identity percentage, isoelectric point, sequence length, hydrophobicity and aminoacid composition. These criterions can be associated. The clustering algorithms are the ones provided by the Cluspack package, i.e. hierarchical clustering using secator, kmieans clustering with DPC (Dendity Point Clustering), and mixture model clustering with AIC or BIC criteria for group definition. The special "Life Domain" criterion clusters sequences into Eukaryota, Archaea, Bacteria, viruses or unknown. The clustering can then be saved, and retrieved later.

Conservation:
Evaluation of the residues conservation along the alignment is of prior importance to access functions of the protein under study. The sequence conservation constitutes a fingerprint of the evolution pressure effect and reveals zone of interest in the sequence.
Many algorithms exist to compute conservation. ORDALIE implements five methods "Threshold" "Liu" "Mean Distances" "Vector Norm" "Multi" "BILD". Threshold" method makes a simple counting of conserved residues and shores them in three groups ! 100% conserved, >80% conserved and >60% physicochemical conserved useing the aminoacids groups PAGST, DEQN, KRH, FYW, ILMV, and C. The "Mean Distance" method is used in the ClustalX program. Conservation computation can be done using the whole set of sequences only ("Global" option) or taking the sequence clustering into account. In this case, ORDALIE outputs the same three global conservation groups and shows also the main conserved columns in each seuence clusters.

3D viewer:
If a sequence name refers to a PDB (Protein Data Bank) ID, the correstonding structure is automatically downloaded and processed.

Other tools:
The "Overview" tool represents the current alignment as a pixel map onto which one or more features can be drawn. This gives a schematic representation of the features distribution along the alignment. 
The "Search" tool allows to find motifs inside the alignment. The search motif syntax follows the FindPattern syntax. The pattern may be degenerated. Occurences of the motif appears in red in teh sequence window.
The "Fetch Information" tool will request the UniProt and Refseq databases \cire{db-uniprot, db-refseq} using the sequence IDs to retrieve relevant information, like organism, description, lineage, etc...


lafin

ORDALIE takes advantage of its underlaying database to store any alignment snapshot. The "snapshot" table contains the name of the snapshot, its description. 

An alignment consists in a description, a set of sequences, a set of features associated with it, one or several clusterings, one or several conservation scores calculations. When an alignment is loaded the first time, a hard copy of it is inserted in the database. This copy can not be changed in any ways and represent the reference alignment. A second copy is also created as a working copy. The user can create as many copies as desired that may corresponds to different analysis. The "Alignment" combobox on the main window allows to switch between registered alignment.


The "snapshot" table conains all the alignments created so far. 


ORDALIE's core consists in a SQLite database. The first time ORDALIE is run with a given alignment, a read-only copy of this alignment and all associated features if any is stored, and a working copy is generated. 


###########################################


ABSTRACT
BACKGROUND :
Multiple sequences alignments play a key role in modern bioinformatics as becoming cornerstones of several types of investigations like protein family analysis, evolutionary inferred studies or comparative genomics. As a result, their analysis escapes from bioinformaticians to enter in every wet labs. Although many tools exist for visualising, editing, building phylogenetic trees, sequence clustering, these tools remain in general disconnected from each other and hard to handle for researchers or students outside the bioinformatics world.

FINDINGS :
We present here ORDALIE, an integrated workbench for manipulating and exploring the informational content of a multiple sequence alignment. ORDALIE is meant to provide tools in a user-friendly environment in order to 
All information is kept inside a SQL database that 

CONCLUSIONs :
ORDALIE has been tailored for non-bioinformatician users and will benefit to researchers or students willing to decipher the information content of their protein family. Several investigation landscapes are provided, ranging from structural to phylogenetic tree contexts. 


KEYWORDS :


FINDINGS :
BACKGROUND :
Since some decades now, Multiple Sequence Alignment (MSA) plays a crucial role in many aspects of modern bioinformatics studies like protein family annotation, evolutionary analysis, comparative genomics or orthology studies. Indeed, as a MSA is mode of homologous genes it intrinsically contains sequence - structure - function - evolution relationships that should be decipher by the biologist. We can enlight some points that arise to achieve MSA information exploration.
Firstly, although algorithms get more and more accurate at building MSAs, a manual curation is still required to ensure a maximal quality. The higher the MSA quality, the best information would be retrieved from it.

Secondly, mining information inside a MSA can be a multi-scale and multi-context search. Along the sequence level important information ranges from domain presence or absence, as for comparative genomics, up to the residue level when investigating point mutation impacts for example. At the taxa level, the search may concern all taxas, or groups of taxa depending on the study type. For example, sequences can be grouped according to their phylogeny (eukaryots, bacteria, ...) or by their physicochemical nature (thermophiles, psychrophiles, mesophiles), or any ways defined by the investigator. Finally, the MSA information exploitation can be done in the context of the alignment, or in the context of the structure of the sequences when known, in order to find sequence - structure - function relationships like spatial functional patches.

Thirdly, MSA exploitation nowadays escapes from the bioinformaticians world to become part of the bench for a broad audience. Indeed, MSA are used in secondary school to introduce phylogeny, at the University, as well as by researchers to get an overall view of a protein under study. 

FINDINGS :
We present here ORDALIE (ORDered ALignment Information Explorer), an integrated platform to manageMSA, extract and add information. ORDALIE tries to fullfill the three points enlighted above. A description of the alignment management, edition, and tools availables is provided below


AVAILABILITY AND REQUIREMENTS :
ORDALIE is written in Tcl/Tk. An alignment editor Tk widget (Biotext) written in C has also been developped and is included in the ORDALIE distribution. Installers and binaries are provided for Windows, Mac OS X and Linux as well as source code and documentation at http://www.lucmoulinier.fr/ordalie. ORDALIE is an open-source program and is distributed under the LGPL licence.


Mining MSA can reveal for example conserved patterns or motifs, presence or absence of domains or regions that can be implied in molecular recognition or function. Mapping information retrieved at the sequence level upon 3D structure may also give insights about recognition or functional mechanisms of the protein.


Numerous tools already exists to visualize, edit, and infer information from a MSA, to compute a phylogenetic tree, to map MSA features onto a 3D structure, to calculate residues conservation or to cluster sequences inside a MSA. 

Such tools were prior developped by bioinformaticians and usually achieve an accurate task. Since the -omic era, bioinformatics spread into biology labs and such tools became part of the biologist bench.


MANAGING MSA


 The main philosophy of ORDALIE relies in three points : i) giving a user-friendly access to a bioinformatic toolbox inside the context of a given MSA. The software is arranged around "modes" each one dedicated to a special task. The information deduced from the MSA or imported inside the software can be accessed in many mode to help understanding the protein family under study.


There are nowadays more than 40 MSA viewers/editors (https://en.wikipedia.org/wiki/List_of_alignment_visualization_software) harboring more or less features and capacities.


The Modes :
Following is a brief description of some of the most important modes available in ORDALIE.

Alignment editor:
Clustering mode:
Conservation analysis mode:
Tree mode:
Structure mode:
Other tools:

ORDALIE focuses on the ease of use and the interconnection of the available tool.

ORDALIE is a desktop application written in Tcl/Tk and C, available for Linux, Windows and MacOS operating systems. Installers can be downloaded at http://www.lbgi.fr/ordalie. It does not require a web connection to run although accession to the internet is compulsory for some functionalities.

Editing facilities :
The "Editor" mode is an extended emulation of the famous SeqLab editor that was part of the GCG Wisconsin package. 

Tree mode : 
ORDALIE allows building of phylogenetic tree based on all or part of the aligned sequences, and all or part of the alignment columns. ORDALIE computes a distance matrix using the selected set of sequences and columns. The phylogenetic tree is infered from this distance matrix using the FastME program. Although likelihood-based tree are more reliable, the speed and reliability of FastME is enough to have a first insight into the protein phylogeny. The robustness of the tree nodes can be assessed through bootstrap scoring. 
The computed tree is then displayed in a separate window. The tree can be viewed as a dendrogram or as a radial tree. The user can re-root the tree, swap branches, display bootstrap values, show nodes abave a bootstrap threshold, change branc labelling, print the tree and more.

Clustering mode :
The analysis of the differences between sequences inside a MSA is usually an important souce of information. Some sequence clustering may be obvious, as partitioning sequences according to their catalytic activities if several ones exists, or partitioning according to the life domain to which the sequences blongs to. ORDALIE provides a clustering mode allowing clustering sequences on all or part of the columns, with several criterions and 4 clustering algorithms. The criterions are identity percentage, isoelectric point, sequence length, hydrophobicity and aminoacid composition. These criterions can be associated. The clustering algorithms are the ones provided by the Cluspack package, i.e. hierarchical clustering using secator, kmieans clustering with DPC (Dendity Point Clustering), and mixture model clustering with AIC or BIC criteria for group definition. The special "Life Domain" criterion clusters sequences into Eukaryota, Archaea, Bacteria, viruses or unknown. The clustering can then be saved, and retrieved later.

Other tools:
The Overview tool represents the current alignment as a pixel map onto which features can be drawn. This gives a schematic representation of the features distribution along the alignment
The Search tool allows to find motifs inside the alignment. The motif follows the FindPattern syntax and may be degenerated.
The Fetch Information tool will request the UniProt and Refseq databases using the sequence IDs to retrieve relevant information, like organism, description, lineage, etc...