Residue Conservation

Introduction

By "protein family" is generally understood a bunch of protein sequences sharing a same function. It has also be shocn that the protein structures inside a family share also a common fold. Such structural and fucntional characteristics are reflected in the conservation of residue when studying a multiple sequence alignment of the protein family. The traces of the function and the structure features is embedded in the conservation of residues common to all sequences, or stretch of residues defining motifs or patterns. Conserved residues may also specify sub-families of protein, or by changing some properties of the whole family, adding new features. It is then essential to develop tools able to identify such residues. As more and more sequences are available from very various origin, the variability observed in a given protein family increases, and simple correlation conserved residues means identical residues doesn't stand anymore.

Methods

Several prediction methods have been developped in order to identify conserved residues. Many of them are based on entropy calculation, similarity matrix comparison, evolutionnary tree trace, free energy based methods for example. In Ordalie, two types of methods are implemented, score based methods and a physico-chemical based one.

scores methods
All the following methods have in common the fact that a conserved column is associated to a high positive score. Ordalie implements : For all these methods the determination of comserved residue from the scores is done by clusterizing the columns according to their scores. The cluster with the highest mean score corresponds to the highly conserved residues, the cluster with the second highest mean score will correspond to column of residue strongly conserved, and so on. Ordalie only keeps the top two clusters, denoted highly conserved and conserved columns. Columns belonging to these two clusters are displayed in black and gray in the main window respectively.

A special method, called Three Dimensional Cluster, associates three scores to each columns: the free energy score, the mean distance score, and the norm ratio score. The clustering is then made in a three-dimensioanl space, and has been proven to give the best results.

The Threshold method

The Threshold method is based on the physico-chemical nature of amino acids appearing in a column. Basically, given a conservation threshold cut-off x, a column is considered to be conserved if x% of the residues, including gaps, are identical in the column. Three types of conservations are considered :