AnnotSV requires different data sources for the annotation of SV. In order to provide a ready to start installation of AnnotSV, each annotation source listed below (that do not require a commercial license) is automatically downloaded during the installation. The aim of each of these sources is explained below. The update of each of these sources is explained in the README file.

Annotation can be performed using either the GRCh37 or GRCh38 build of the human genome (user defined), but depending on the availability of some data sources there might be some limitations. Some of the annotations are linked to the gene name and thus provided independently of the genome build.

For the annotations sources versions, please look at the README file.

Four different types of annotations are provided, which are summarized here:

1. Genes-based annotations:
Each gene overlapped by the SV to annotate is reported (even with 1bp overlap)
2. Breakpoints annotations:
Specific to each annotation
- Gene annotations
- DDD gene annotations
- OMIM annotations
- ACMG annotations
- Gene intolerance annotations (ExAC)
- Haploinsufficiency annotations (DDD)
- Haploinsufficiency and Triplosensitivity annotations (ClinGen)
- Phenotype-driven annotations (Exomiser)
- External Gene annotation files (Optional)
- GC content annotations
- Repeated sequences annotations
3. Annotations with features overlapping the SV:
Overlap (%) = (length of overlap between the SV to annotate and the feature) * 100 / (SV to annotate length)
4. Annotations with features overlapped with the SV:
Overlap (%) = (length of overlap between the SV to annotate and the feature) * 100 / (feature length)
- DGV Gold Standard frequency annotations
- gnomAD frequency annotations
- I.M. Hall's lab frequency annotations
- DDD frequency annotations
- 1000 genomes frequency annotations
- External BED annotation files (Optional)

- Promoters annotations
- dbVar pathogenic NR SV annotations
- GeneHancer annotations
- TAD boundaries annotations
- COSMIC annotations
- Homozygous and heterozygous SNV/indel annotations
- Compound heterozygosity annotations
- External BED annotation files (Optional)
Aim:
The “Gene annotation” aims at providing information for the overlapping known genes with the SV in order to list the genes from the well annotated RefSeq or ENSEMBL databases. These annotations include the definition of the genes and corresponding NM transcripts from NCBI (Default value, ENST transcripts from ENSEMBL can be user defined), the length of the CoDing Sequence (CDS) and of the transcript, the location of the SV in the gene (e.g. « txStart-exon3 ») and the coordinates of the intersection between the SV and the transcript.

Method:
For each gene, only a single transcript from all transcripts available for this gene is reported in the following order of preference:
- The transcript selected by the user with the "-txFile" option is reported
- The transcript with the longest CDS is reported (considering the overlapping region with the SV)<
- If there is no difference in CDS length, the longest transcript is reported.

Annotation columns:
Gene name Gene symbol
tx Transcript symbol
CDS length Length of the CoDing Sequence (CDS) (bp) overlapping with the SV
tx length Length of transcript (bp) overlapping with the SV
location SV location in the gene (e.g. « txStart-exon1 », « intron3-exon7 »)
location2 SV location in the gene’s coding regions (e.g. « 3’UTR-CDS »)
intersectStart Start position of the intersection between the SV and the transcript
intersectEnd End position of the intersection between the SV and the transcript
Aim:
The Deciphering Developmental Disorders (DDD) Study has recruited nearly 14,000 children with severe undiagnosed developmental disorders, and their parents from around the UK and Ireland. The patients have been deeply phenotyped by their referring clinician via DECIPHER using the Human Phenotype Ontology. The DNA from these children have been explored using high resolution exon-arrayCGH and exome sequencing (trio) to investigate the genetic causes of their abnormal development. These annotations give additional information on each gene overlapped by a SV (independently of the genome build version).

Annotation columns:
DDD_status Deciphering Developmental Disorders (DDD) category. e.g. confirmed, probable, possible, …
DDD_mode Deciphering Developmental Disorders (DDD) allelic requirement. e.g. biallelic, hemizygous, …
DDD_consequence Deciphering Developmental Disorders (DDD) mutation consequence. e.g. "loss of function", uncertain, …
DDD_disease Deciphering Developmental Disorders (DDD) disease name. e.g. "OCULOAURICULAR SYNDROME"
DDD_pmids Deciphering Developmental Disorders (DDD) pmids
Aim:
OMIM (Online Mendelian Inheritance in Man) focuses on the relationship between phenotype and genotype. These annotations give additional information on each gene overlapped by a SV (independently of the genome build version). Moreover, a morbid genes list is provided.

Annotation columns:
Mim Number OMIM unique six-digit identifier
Phenotypes e.g. Charcot-Marie-Tooth disease
Inheritance e.g. AD (= "Autosomal dominant")2
morbidGenes
morbidGenesCandidates
Aim:
The American College of Medical Genetics and Genomics has published recommendations for reporting incidental or secondary findings in genes with a medical benefit. The most recent version of the recommendations is the ACMG SF v2.0 including 59 genes.

Annotation columns:
ACMG ACMG gene
Aim:
Gene intolerance annotations from the ExAC give the significance deviation from the observed and the expected number of variants for each gene:

Annotation columns:
synZ_ExAC Positive synZ_ExAC (Z score) indicate gene intolerance to synonymous variation
misZ_AxAC Positive misZ_ExAC (Z score) indicate gene intolerance to missense variation
pLI_ExAC Score computed in the ExAc database indicating the probability that a gene is intolerant to a loss of function variation (Nonsense, splice acceptor and donor variants caused by SNV). ExAC consider pLI >= 0.9 as an extremely LoF intolerant set of genes
delZ_ExAC Positive delZ_ExAC (Z score) indicate greater deletion intolerance
dupZ_ExAC Positive dupZ_ExAC (Z score) indicate greater duplication intolerance
cnvZ_ExAC Positive cnvZ_ExAC (Z score) indicate greater CNV intolerance
Aim:
Haploinsufficiency, wherein a single functional copy of a gene is insufficient to maintain normal function, is a major cause of dominant disease. As detailed in DECIPHER, over 17,000 protein coding genes have been scored according to their predicted probability of exhibiting haploinsufficiency:
- High ranks (e.g. 0-10%) indicate a gene is more likely to exhibit haploinsufficiency
- Low ranks (e.g. 90-100%) indicate a gene is more likely to NOT exhibit haploinsufficiency.

Annotation columns:
HI_DDDpercent Haploinsufficiency ranks
Aim:
The ClinGen Consortium Rating System is curating genes and regions of the genome to assess whether there is evidence to support that these genes/regions are dosage sensitive.
Haploinsufficiency and triplosensitivity scorings are ranged as follow:

Annotation columns:
HI_CGscore HaploInsufficiency Score
TriS_CGscore TriploSensitivity Score
Aim:
To score genes overlapped with a SV on biological relevance to the individual phenotype, AnnotSV takes use of Exomiser.
For a given phenotype, a HPO-based score corresponding to a damaging probability is provided for each gene overlapped with an SV so that:
- Genes previously associated with disease can be highlighted easily
- Genes not previously associated with disease can be highlighted
- Genes associated with diseases that have little or no similarity to the observed phenotypes can be removed along

HPO:
AnnotSV uses the Human Phenotype Ontology (version reported in the AnnotSV output).
Find out more at http://www.human-phenotype-ontology.org

Please cite the 3 following articles if you use these data in your work:
- AnnotSV: An integrated tool for Structural Variations annotation.
Geoffroy V, Herenger Y, Kress A, Stoetzel C, Piton A, Dollfus H, Muller J. Bioinformatics (2018) doi: doi:10.1093/bioinformatics/bty304
- Next-generation diagnostics and disease-gene discovery with the Exomiser.
Smedley, D., Jacobsen, J.O.B., Jager, M., Köhler, S., Holtgrewe, M., Schubach, M., Siragusa, E., Zemojtel, T., Buske, O.J., Washington, N.L., et al. Nat Protoc (2015) doi:10.1038/nprot.2015.124
- Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources.
Sebastian Köhler, Leigh Carmody, Nicole Vasilevsky, Julius O B Jacobsen, et al. Nucleic Acids Research (2018) doi: 10.1093/nar/gky1105

Annotation columns:
EXOMISER_GENE_PHENO_SCORE Exomiser score for how close each overlapped gene is to the phenotype
HUMAN_PHENO_EVIDENCE Phenotypic evidence from Human model
MOUSE_PHENO_EVIDENCE Phenotypic evidence from Mouse model
FISH_PHENO_EVIDENCE Phenotypic evidence from Fish model
Usage:
The user enters a human phenotype as a list of HPO terms. The HPO terms need to be the most specific as possible.
According to our own experience (limited), a gene with an EXOMISER_GENE_PHENO_SCORE >= 0.7 can be considered to be associated with the disease. For a gene that has not been previously associated with a disease, the threshold can be lowered to 0.5.

Exomiser informations:
Exomiser is a tool to annotate and prioritize exome variants:
- The Exomiser web site is available here
- The Exomiser development pages are hosted on GitHub
- How to cite Exomiser?
In order to further enrich the annotation for each SV gene, AnnotSV can integrate external annotations imported from tab separated values file(s) into the output file. The first line should be a header including a column entitled "genes".

The following example has been set to provide annotation for the interacting partners of a gene.
genes Interacting genes
BBS1 BBS7, TTC8, BBS5, BBS4, BBS9, ARL6, BBS2, RAB3IP, BBS12, BBS10
"Interacting genes" annotation columns are then available in the output file.
Aim:
The Database of Genomic Variants (DGV) provides SV defined as DNA elements with a size >50 bp. The content of DGV is only representing SV identified in healthy control samples from large cohorts published and integrated by the DGV team. The annotations will give information about whether your SV is a rare or a common variant.

Annotation columns:
DGV_GAIN_IDs DGV Gold Standard GAIN IDs overlapped with the annotated SV
DGV_GAIN_n_samples_with_SV Number of individuals with a shared DGV_GAIN_ID
DGV_GAIN_n_samples_tested Number of individuals tested
DGV_GAIN_Frequency Relative GAIN Frequency: (DGV_GAIN_n_samples_with_SV / DGV_GAIN_n_samples_tested)
DGV_LOSS_IDs DGV Gold Standard LOSS overlapped with the annotated SV
DGV_LOSS_n_samples_with_SV Number of individuals with a shared DGV_LOSS_ID
DGV_LOSS_n_samples_tested Number of individuals tested
DGV_LOSS_Frequency Relative LOSS Frequency: (DGV_LOSS_n_samples_with_SV / DGV_LOSS_n_samples_tested)
Method:
- First, AnnotSV searches for DGV Gold Standard variants overlapping the SV to annotate.
- Second, only the DGV variants overlapping at least 70% of your SV in size/location are selected (default value, a different percentage or a reciprocal overlap can also be user defined).
- Third, the DGV IDs are reported and all DGV samples information are processed and merged. The counts of unique samples with gains and losses, the number of non-redundant samples tested in the related studies and subsequent relative frequencies are calculated and reported (genotype data are not considered).



Warning:
- Exceptional overestimation of the relative frequencies:
In DGV Gold Standard (March 2016), ~10% of the supporting variants are not released with sample information preventing AnnotSV to properly differentiate whether some variation are redundant or not. Consequently, some relative frequencies can be exceptionally overestimated by AnnotSV.
- Gain/Loss:
A SV call in DGV can be relative to a specific reference sample, a pool of reference samples or relative to the reference assembly. Since different reference samples may have been used in different studies, what is called as a gain in one study may actually be called a loss in another.
Aim:
A reference atlas of SV from deep WGS of 14,891 individuals across diverse global populations has been constructed as a component of gnomAD. These data provide an initial step to help SV analysis and interpretation in the era of WGS (Collins et al., 2020).

Annotation columns:
GD_ID gnomAD IDs overlapping the annotated SV with the same SV type
GD_AN gnomad total number of alleles genotyped (for biallelic sites) or individuals with copy-state estimates (for multiallelic sites)
GD_N_HET gnomAD number of individuals with heterozygous genotypes
GD_N_HOMALT gnomAD number of individuals with homozygous alternate genotypes
GD_AF Maximum of the gnomAD allele frequency (for biallelic sites) and copy-state frequency (for multiallelic sites)
GD_POPMAX Maximum of the gnomAD maximum allele frequency across any population
GD_ID_others Other gnomAD IDs overlapping the annotated SV (with a different SV type)
Aim:
Ira M. Hall’s lab characterized SV in 17,795 deeply sequenced human genomes from common disease trait mapping studies (Abel et al., 2020). They publicly released SV frequency annotations to guide SV analysis and interpretation in the era of WGS.

Annotation columns:
IMH_ID Ira M. Hall’s lab IDs overlapping the annotated SV
IMH_AF IMH Allele Frequency
IMH_ID_others Other IMH IDs overlapping the annotated SV (with a different SV type)
Aim:
AnnotSV takes advantage of the DDD study (national blood service controls + generation Scotland controls), representing the 845 samples currently available (an update is planned in the near future).

Annotation columns:
DDD_SV Deciphering Developmental Disorders (DDD) SV coordinates from the DDD study (data control sets) overlapped with the annotated SV
DDD_DUP_n_samples_with_SV Number of individuals with a shared DDD_DUP
DDD_DUP_Frequency DUP Frequency
DDD_DEL_n_samples_with_SV Number of individuals with a shared DDD_DEL
DDD_DEL_Frequency DEL Frequency
Aim:
The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied. Analyses were conducted looking at both the short variations (up to 50 base pairs in length) and also the CNV. These annotations give additional information on the CNV allele frequencies from the 1000 genomes database overlapped by a SV to annotate.

Annotation columns:
1000g_event 1000 genomes events (e.g. DEL, DUP, ALU...) overlapped with the annotated SV
1000g_AF 1000 genomes allele frequency
1000g_max_AF Maximum observed allele frequency across the 1000 genomes populations
Aim:
Several users might want to add their own private annotations to the one already provided by AnnotSV.
AnnotSV can integrate external annotations for specific regions that will be imported from a BED file into the output file.

Input user BED file:
The following example has been set to provide the SV overlap with Regions of Homozygosity (RoH) of 2 individuals (sample1 and sample2):

"example.bed"
#ChromStartEndRoH
12806107107058351sample1, sample2
122568753625699754sample2
"RoH" annotation column is then available in the output file.
Aim:
The contribution of SV overlapping with promoters to disease etiology is well established, affecting gene expression, although understanding the consequences of these regulatory variants on the human transcriptome remains a major challenge. AnnotSV reports the list of the genes whose promoters are overlapped by the SV.

Annotation columns:
promoters List of the genes whose promoters are overlapped by the SV
Method:
Promoters are defined by default as 500 bp upstream from the transcription start sites (using the RefGene data). Nevertheless, the user can define a different bp.
Aim:
dbVar is the NCBI’s database of genomic structural variation collecting insertion/deletion/duplications/mobile elements insertions/translocations data from large initiative including also medically relevant variations. A non-redundant version of the database, dbVar non-redundant SV (NR SV) datasets include more than 2.2 million deletions, 1.1 million insertions, and 300,000 duplications. These data are aggregated from over 150 studies including 1000 Genomes Phase 3, Simons Genome Diversity Project, ClinGen, ExAC, and others.
By selecting pathogenic SV records from the dbVar NR SV database, AnnotSV obtained a clinically-relevant human SV dataset.

Annotation columns:
dbVar_event dbVar NR SV event types (e.g. deletion, duplication…)
dbVar_variant dbVar NR SV accession (e.g. nssv1415016)
dbVar_status dbVar NR SV clinical assertion (e.g. pathogenic, likely pathogenic)
Aim:
Enhancer and promoter genomic aberrations have been reported to underlie genetic diseases. A current challenge when performing medically oriented next generation sequencing is a capacity to tackle regulatory elements affected by SVs. For this aim we include GeneHancer (Fishilevich et al., 2017), an integrated compendium of human promoters, enhancers and their inferred target genes.

WARNING:
GeneHancer data, as part of the GeneCards Suite, cannot be redistributed. Thus, GeneHancer annotation cannot be supplied as part of the AnnotSV sources. Users need to request the up-to-date GeneHancer data dedicated to AnnotSV (“GeneHancer_version_for_annotsv.zip“) by contacting the GeneCards team:
Academic users: genecards@weizmann.ac.il
Commercial users: support@lifemapsc.com


Annotation columns:
GHid_elite List of the GeneHancer (GH) IDs for each “elite” element overlapped with the annotated SV
GHid_not_elite List of the GeneHancer (GH) IDs for each “not elite” element overlapped with the annotated SV
GHtype Type of the overlapped GH element(s) (Enhancer or Promoter)
GHgene_elite List of the genes for which an “elite” element-gene relation was identified
GHgene_not_elite List of the genes for which a “not elite” element-gene relation was identified
GHtissue List of the tissues in which elements were identified
Aim:
The spatial organization of the human genome helps to accommodate the DNA in the nucleus of a cell and plays an important role in the control of the gene expression. In this nonrandom organization, topologically associating domains (TAD) emerge as a fundamental structural unit able to separate domains and define boundaries. Disruption of these structures especially by SV can result in gene misexpression (Lupianez, et al., 2016).
Annotation columns:
TADcoordinates Coordinates of the TAD whose boundaries overlapped with the annotated SV (boundaries included in the coordinates)
ENCODEexperiments ENCODE experiments from where the TAD have been defined
Aim:
COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer.

WARNING:
COSMIC data cannot be redistributed. Thus, COSMIC annotation cannot be supplied as part of the AnnotSV sources. Users are required to register in order to download COSMIC data files, but only non-academic organisations need to pay a license fee. More information can be found on their licensing page.

Annotation columns:
COSMIC_ID COSMIC identifier
COSMIC_MUT_TYP Defined as Gain or Loss
AnnotSV can take VCF file(s) with SNV/indel as input to the command line.

These annotations report the counts of homozygous and heterozygous SNV/indel identified from the patients NGS data (user defined samples) and presents in the interval of the SV to annotate.

Annotation columns:
#hom(sample) Number of homozygous variants in the individual “sample” which are presents:
- in the SV for the “full” annotation
- between intersectStart and intersectEnd for the “split” annotation.
Values are extracted from the input VCF file(s)
#htz(sample) Number of heterozygous variants in the individual “sample” which are presents:
- in the SV for the “full” annotation
- between intersectStart and intersectEnd for the “split” annotation.
Values are extracted from the input VCF file(s)
#htz/allHom(sample) Ratio for QC filtering: #htz(sample)/#allHom(sample)
#htz/total(cohort) Ratio for QC filtering: #htz(sample)/#total(cohort)
#total(cohort) Total count of SNV/indel called from the sample and present in the interval of the deletion
Aim:
These annotations can be used by the user to filter out false positive SV calls or to confirm events as following:



-Homozygous deletion can be identified as a false positive by noting the presence of SNV/indel called at the predicted locus of the deletion in a sample.
-Heterozygous deletion can be identified as a false positive by noting the presence of heterozygous SNV/indel called at the predicted locus of the deletion in a sample. If no heterozygous SNV/indel are presents, the heterozygous deletion can be confirmed by reporting the presence of homozygous SNV/indel at that locus in the sample.
AnnotSV can take a VCF file(s) with SNV/indel as input to the command line that is already filtered for genotype, frequency and effects on protein level. AnnotSV can report the heterozygous SNV/indel called (by any sequencing experiment) in the gene overlapped by the SV to annotate, as well in ‘healthy’ and ‘affected’ samples (user defined samples). AnnotSV offers an efficient way to highlight compound heterozygotes with one SNV/indel and one SV in the same gene. Indeed, in recessive genetic disorders, both copies of the gene are malfunctioning. This means that the maternally as well as the paternally inherited copy of an autosomal gene harbors a pathogenic mutation. In addition, if the parents are non-consanguineous, compound heterozygosity is the best explanation for a recessive disease.

User challenge:
The user challenge in filtering variants for compound heterozygotes is to know whether the two heterozygous variants (the SNV/indel and the SV) are in cis or in trans. And when sequencing data of more than one family member is available, one can exclude certain variants based on rules of Mendelian inheritance (transmitted in a compound heterozygous mode from parents to the patient(s)).

compound-htz(sample) List of heterozygous SNV/indel (reported with “chrom_position”) presents in the gene overlapped by the annotated SV
Aim:
GC content (as well as repeated sequences, DNA sequence identity and concentration of the PRDM9 homologous recombination hotspot motif 5′-CCNCCNTNNCCNC-3′) is positively correlated with the frequency of nonallelic homologous recombination (NAHR). Indeed, NAHR hot spots have a significantly higher GC content (Dittwald et al, 2013). This information with others could help identifying a novel locus for recurrent NAHR-mediated SV.

Method:
The GC content is calculated around each SV breakpoint (+/- 100bp) then reported.

Annotation columns:
GCcontent_left GC content around the left SV breakpoint (+/- 100bp)
GCcontent_right GC content around the right SV breakpoint (+/- 100bp)
Aim:
Repeated sequences (as well as GC content, DNA sequence identity and presence of the PRDM9 homologous recombination hotspot motif 5′-CCNCCNTNNCCNC-3′) play a major role in the formation of structural variants.

Method:
The overlapping repeats are identified at the SV breakpoint (+/- 100bp) and reported (coordinates and type).

Annotation columns:
Repeats_coord_left Repeats coordinates around the left SV breakpoint (+/- 100bp)
Repeats_type_left Repeats type around the left SV breakpoint (+/- 100bp). e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, …
Repeats_coord_right Repeats coordinates around the right SV breakpoint (+/- 100bp)
Repeats_type_right Repeats type around the right SV breakpoint (+/- 100bp). e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, …