BAliBASE Reference Set 9 includes four alignment reference subsets containing protein families with linear motifs (LMs). LMs include important functional sites such as protein interaction sites, cell compartment targeting signals, post-translational modification sites or cleavage sites. These sites are often found in disordered regions that are difficult to align by classical multiple sequence alignment methods. The majority of LMs are between 3 and 10 amino acids in length and most have one or more ambiguous (variable) or wildcard (totally variable) residues. Their short and degenerate nature makes real LMs difficult to distinguish from the background distribution of randomly occurring false positive motifs.
For Subset 1, only sequences with true positive motifs are selected. The subset is further organised into three different groups, according to sequence variability: <20% identity, 20-40% identity and 40-80% identity.
Sequences with <20% identity: Subset 1 (1)
Sequences with 20-40% identity: Subset 1 (2)
Sequences with 40-80% identity: Subset 1 (3)
Subset 2 contains sequences with possible 'errors' (badly predicted sequences, fragments, splicing variants). These sequences share some homology with the reference sequence, but do not contain the ELM motif.
Sequences with true positive motifs only: Subset 2 (1)
Sequences with true positive motifs and sequences with errors: Subset 2 (2)
Subset 3 contains true positive sequences aligned with sequences containing false positive motifs.
Sequences with true positive motifs only: Subset 3 (1)
Sequences with true positive motifs and sequences with false positive motifs: Subset 3 (2)
Subset 4 contains true positive sequences aligned with sequences that do not contain any examples of the motif.
Sequences with true positive motifs only: Subset 4 (1)
Sequences with true positive motifs and sequences with no motifs: Subset 4 (2)