Help page

Help Page

Introduction

Unrecognized frameshifts, in-frame stop codons and sequencing errors lead to Interrupted CoDing Sequence (ICDS) can seriously affect all subsequent steps of functional characterisation, from in silico analysis to high-throughput proteomic projects. ICDS database is a database containing ICDS detected by a similarity-based approach. The definition of each interrupted gene is provided as well as the ICDS genomic localisation with the surrounding sequence. Furthermore, to facilitate the experimental characterisation of ICDS, we propose optimised primers for re-sequencing purposes.

ICDS Prediction Principle

The principle underlying our program is the detection of adjacent genes on the same DNA strand that share common homologs. The program can scan annotated microbial genomes or raw genomic sequences. In this latter case, the genomic sequence is first analysed by a gene prediction program such as Glimmer (1). Genes are translated and the protein sequences compared to a public protein database using blastp (2). The program proceeds as follow: i) the 10 top blast hits (E<10-3) are extracted for each gene, ii) The list of homologs of a gene is then compared to the lists obtained for adjacent genes. The comparison has been extended to the four neighbouring genes to limit effects of small overpredicted genes; iii) pairs of proteins exhibiting at least one common homolog are retained. Such a pair can correspond to ICDS or to paralogous adjacent genes. If a significant similarity (E<10-3) is detected between the components of a pair, those genes are considered as paralogs and discarded from the analysis while absence of similarity define ICDS; iv) The approximate genomic localisation of the CDS rupture is calculated from the blastP HSPs. A region of 500 bp surrounding the CDS rupture is extracted from the genomic sequence and scanned to automatically design optimal sequencing primers.

Automatic design of primers

The sequencing primers have been designed using an optimized version of the CADO4MI program (Computed Assisted Design of Oligonucleotide for MIcroarray) (Muller, manuscript in preparation). The query sequence is scanned using a sliding window analysis with window length set to primer size (e.g. 21 nts) and step-size 10. The melting temperature (Tm) is calculated using the Wallace rules (3). Only 21mers with Tm =63 +/- 5 C are considered. The 21mers are compared to the complete reference genome using the BLASTN program to assess specificity and to avoid hybridization with another part of the genome. Sequence selection is done automatically by selecting the primer pairs which have high specificity and the shortest amplification area. Primers have been searched excluding the 50 bp surrounding the ICDS and for a maximum length of 500bp.

How to use this database

The first page allows the user to search for ICDS by species browsing or similarity searches using BLASN or TBLASTN program. Species browsing approach display all the detected ICDS for a given genome. We indicate for each ICDS the genomic localisation, the optimised primers for re-sequencing purposes as well as the predicted function of the protein. Detailed information is available for each ICDS. We detailed the length of the ICDS region, the orientation of the genes, the length and Tm of the predicted primers. Moreover we provide the surrounding sequence with the predicted frameshift region (indicated in blue) and primers localisation (indicated in red).

References

1. Delcher, A.L., Harmon, D., Kasif, S., White, O. and Salzberg, S.L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res, 27, 4636-4641.
2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389-3402.
3. Wallace, R.B., Shaffer, J., Murphy, R.F., Bonner, J., Hirose, T. and Itakura, K. (1979) Hybridization of synthetic oligodeoxyribonucleotides to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Res, 6, 3543-3557.