S I B I S

The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes, and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors.

SIBIS is designed to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences from the rhesus macaque genome with experimentally validated errors and showed that the sensitivity SIBIS is significantly higher than previous methods, with only a small loss of specificity.

The integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences.

Right click to download an archive of the reference set of protein sequences.

Right click to download an archive of the source code.




If you have any problem or question, please, feel free to contact us at thompson@unistra.fr