BAliBASE Reference Set 10 : benchmark alignments containing sequences with subfamily specific features, motifs in disordered regions and fragmentary/erroneous sequences

Access

Download all the alignments by ftp.

Download the 'c' program to compare a test alignment with the BAliBASE reference.

BAliBASE Reference Set 10 includes 218 large, complex protein families, designed to reproduce today's sequence exploration requirements and addressing 3 major points:

(i) Most of the existing MSA benchmarks - and as a consequence, most MSA construction algorithms - have focused on the patterns conserved in the majority of the sequences and not enough attention has been paid to the less frequent patterns, or SDPs, that might indicate subfamily-specific or context-specific functions.

(ii) Current MSA programs for protein sequences generally model globular domain structure and evolution. Nevertheless, many proteins, particularly in eukaryotes, are unstructured (natively disordered) or contain large unstructured regions.These regions frequently contain motifs, such as signalling sequences or sites of posttranslational modifications, that are involved in the regulatory functions of a cell.

(iii) The use of high throughput sequencing technologies has produced huge volumes of very noisy data, including fragmentary or otherwise erroneous sequences, that affect MSA program performance.


All alignments

View all benchmark alignments


If you have any problems/comments/questions, please e-mail Julie Thompson