Sequence Clustering

One of the main features of Ordalie is the ability to analyse a protein family in a hierarchical manner along the taxa axis. The analyssis can focus on the entire family, sub-families or individual taxa. For the sub-families analysis, it is obvious that the retrieved information from the analysis is dependent of how the sub-families have been determined. Two main clustering schemes are implemented in Ordalie: a sequence based clustering and a phylum based clustering.

Sequence based clustering

The sequence based clustering uses two programs, Secator [ref] and DPC [ref] developped by N. Wicker. Both programs are based upon the sequence pairwise similarity.

These two methods automatically detext the suited number of clusters. Nevertheless, it is possible to constraint the clustering to a user-defined number of clusters.

Phylum based clusters

The clustering is done according to 4 main phyla arbitrary choosen as:

Sequences for which the phylum is unknown are clustered together. The phylum information is present by default in a Macsims file. Otherwise, Ordalie will query sequence databanks through a web server to retrieve the taxa information for each sequence. Be aware this process can take some time depending on the number of sequences in the alignment and the internet traffic.