Sequence Clustering

One of the main features of Ordalie is the ability to analyse a protein family in a hierarchical manner along the taxa axis. The analyssis can focus on the entire family, sub-families or individual taxa. For the sub-families analysis, it is obvious that the retrieved information from the analysis is dependent of how the sub-families have been determined. Two main clustering schemes are implemented in Ordalie: a sequence based clustering and a phylum based clustering.

Sequence based clustering

The sequence based clustering uses two programs, Secator [ref] and DPC [ref] developped by N. Wicker. Both programs are based upon the sequence pairwise similarity.

Secator uses these similarities to compute a distance matrix from which it builds a phylogenetic tree. The clusters are formed by cutting the tree at nodes that maximize the loss of inertia inside the tree.
DPC uses the similarities as coordinates. The set of points is then cut until the density of points inside the two new clusters created have an overlapping density lower than the one of their parent cluster.

These two methods automatically detext the suited number of clusters. Nevertheless, it is possible to constraint the clustering to a user-defined number of clusters.

Phylum based clusters

The clustering is done according to 4 main phyla arbitrary choosen as:

eukaryota
archaea
prokaryota
viruses

Sequences for which the phylum is unknown are clustered together. The phylum information is present by default in a Macsims file. Otherwise, Ordalie will query sequence databanks through a web server to retrieve the taxa information for each sequence. Be aware this process can take some time depending on the number of sequences in the alignment and the internet traffic.