Residue Conservation

Introduction

By "protein family" is generally understood a bunch of protein sequences sharing a same function. It has also be shocn that the protein structures inside a family share also a common fold. Such structural and fucntional characteristics are reflected in the conservation of residue when studying a multiple sequence alignment of the protein family. The traces of the function and the structure features is embedded in the conservation of residues common to all sequences, or stretch of residues defining motifs or patterns. Conserved residues may also specify sub-families of protein, or by changing some properties of the whole family, adding new features. It is then essential to develop tools able to identify such residues. As more and more sequences are available from very various origin, the variability observed in a given protein family increases, and simple correlation conserved residues means identical residues doesn't stand anymore.

Methods

Several prediction methods have been developped in order to identify conserved residues. Many of them are based on entropy calculation, similarity matrix comparison, evolutionnary tree trace, free energy based methods for example. In Ordalie, two types of methods are implemented, score based methods and a physico-chemical based one.

scores methods

All the following methods have in common the fact that a conserved column is associated to a high positive score. Ordalie implements :

a modified version of Lockless et all [ref] which computes the free energy associated to each column. In Ordalie, the free energy is weighted by the number of amino acids present in the column.
the mean distance [ref] as in the ClustalX program
the norm ratio (to be puclished): each amino acids are represented as a vector in two dimension where x corresponds to the volume and y to the polarity of the amino acid. Along a column, the norm of the mean norm of the sum vector of all amino acid is divided by the norm of the most represented vector along the column. This ratio is then weighted by the number of amino acids present in that comumn.

For all these methods the determination of comserved residue from the scores is done by clusterizing the columns according to their scores. The cluster with the highest mean score corresponds to the highly conserved residues, the cluster with the second highest mean score will correspond to column of residue strongly conserved, and so on. Ordalie only keeps the top two clusters, denoted highly conserved and conserved columns. Columns belonging to these two clusters are displayed in black and gray in the main window respectively.

A special method, called Three Dimensional Cluster, associates three scores to each columns: the free energy score, the mean distance score, and the norm ratio score. The clustering is then made in a three-dimensioanl space, and has been proven to give the best results.

The Threshold method

The Threshold method is based on the physico-chemical nature of amino acids appearing in a column. Basically, given a conservation threshold cut-off x, a column is considered to be conserved if x% of the residues, including gaps, are identical in the column. Three types of conservations are considered :

a threshold of 100 will lead to identify "identity residues" and will be displayed in white on a black background
a threshold of 80 (the default) will identigy "conserved residues" displayed in white on a gray background
a threshold of 60, and the allowed equivalence of residues according to their physico-chemical properties (we defined 6 groups of residues : ILMV, FYW, KRH, DEQN, PAGST, C) will define the "similarity residues" which are displayed in black on a gray background

Case of sub-families

When the alignment has been shared into sub-families, all the above methods will try to identify conserved residues inside each sub-family. A background color will be associated to each sub-family. In a sub-family context, only the first cluster of highly conserved residue is deisplayed. For the Threshold method, residues inside a family that are 100% present, or 80% present if an other sub-family display also a conserved residue, will be shown.