Q: What are Structural Variations (SV)?

SV are generally defined as variation in a DNA region that vary in length from ~50 base pairs to many megabases and include several classes such as translocations, inversions, insertions, deletions.

Q: What are Copy Number Variations (CNV)?

CNV are deletions and duplications in the genome (unbalanced SV) that vary in length from ~50 base pairs to many megabases.

Q: What are the differences between SV and CNV?

CNV are unbalanced SV with gain or loss of genomic material. For example, a heterozygous duplication as a CNV will be characterized with the start and end coordinates and the number of copies which is 3.

Q: Can AnnotSV annotate every format of SV?

AnnotSV supports as well VCF or BED format in input.
- VCF format supports complex rearrangements with breakends, that can arbitrary be summarized as a set of novel adjacencies, as described in the Variant Call Format Specification VCFv4.3 (Jul 2017).
- BED format doesn’t allow inter-chromosomal feature definitions (e.g. inter-chromosomal translocation). A new file format (BEDPE) is proposed in order to concisely describe disjoint genome features but it is not supported by AnnotSV.

Q: I would like to annotate my SV with new annotation sources but I don’t know how to do that…

No problem. AnnotSV is under active and continuous development. You can email me with a detailed request and I will answer as quickly as possible.

Q: I have just updated AnnotSV or the annotations sources and the annotation process is longer than usual, is it normal?

After an update of AnnotSV sources, some files will be reprocessed and thus taking several additional time. Further use of AnnotSV will be quicker!

Q: How to cite AnnotSV in my work?

If you are using AnnotSV, please cite our work using the following reference:

AnnotSV: An integrated tool for Structural Variations annotation.
Geoffroy V, Herenger Y, Kress A, Stoetzel C, Piton A, Dollfus H, Muller J.
Bioinformatics. 2018 Apr 14. doi: 10.1093/bioinformatics/bty304

Q: What are the WARNINGs that AnnotSV mention while running?

AnnotSV writes to the standard output progress of the analysis including warnings about issues or missing information that can be either blocking or simply informative.

Q: Why are some values empty or set to -1 in the output files?

When no information is available for a specific type of annotation, then the value is empty. Regarding the frequencies, the default is set to -1.

Q: Why some SV have empty gene annotation in the output file?

If a SV is located in an intergenic region and so doesn’t cover a gene, then the SV is reported in the output file but without gene annotation.

Q: Why can we have several gene annotations for one SV??

In some cases, one SV overlaps a large portion of the genome including several genes. In these cases, the annotation of the SV is split on several lines.

Annotation example for the deletion 1:16892807-17087595
AnnotSV keep all gene annotations, with only one transcript annotation for each gene:
SV chrom SV start SV end Gene name NM CDS length tx length location
1 16892807 17087595 DEL CROCCP2 NR_026752 1 12652 txStart-txEnd
1 16892807 17087595 DEL ESPNP NR_026567 1 28941 txStart-txEnd
1 16892807 17087595 DEL FAM231A NM_001282321 511 511 txStart-txEnd
1 16892807 17087595 DEL FAM231C NM_001310138 511 656 txStart-txEnd
1 16892807 17087595 DEL LOC102724562 NR_135824 1 2998 txStart-txEnd
1 16892807 17087595 DEL MIR3675 NR_037446 1 75 txStart-txEnd
1 16892807 17087595 DEL MST1L NM_001271733 2015 6468 txStart-exon14
1 16892807 17087595 DEL MST1P2 NR_027504 1 4848 txStart-txEnd
1 16892807 17087595 DEL NBPF1 NM_017940 2912 47294 intron3-txEnd

Q: I am confused by the difference between the 'full' and the 'split' AnnotSV type mode. CNVs have been split into several lines, but each line get different DB annotation (DGV, 1000g…). I thought that same region should have the same annotations (excluding gene/transcript)?

AnnotSV builds 2 types of annotations, one based on the full-length SV (corresponding to the AnnotSV type = "full") and one based on each gene within the SV (corresponding to the AnnotSV type = "split"). Thus you will have access to:
Be careful: the first 3 columns (SV chrom, SV start and SV end) remains the same despite being in "full" or in "split" type.

Regarding these "split" lines:

Q: Why does AnnotSV only report overlapping SV (from gnomAD, IHM…) with the same type?

Because reporting more and more columns is problematic, we decided to report more precisely the information of the same type of SV as the one in question (e.g. a duplication with a duplication, a deletion with a deletion …). However, to keep the user aware with different type of rearrangements overlapping the SV to annotate, the ID of such events are reported in a specific annotation column (e.g. GD_ID_others, IMH_ID_others…)

Q: What do OMIM Inheritance annotations mean?

AD = "Autosomal dominant"
AR = "Autosomal recessive"
XLD = "X-linked dominant"
XLR = "X-linked recessive"
YLD = "Y-linked dominant"
YLR = "Y-linked recessive"
XL = "X-linked"
YL = "Y-linked"

Q: Why do I get this error message: “Feature (10:134136286-134136486) beyond the length of 10 size (133797422 bp). Skipping.”

One possibility is that you are using the bad “-genomeBuild” option. For example, you are using a bedfile in input with the SV coordinates on GRCh37 but with the “-genomeBuild GRCh38” option.

Q: How to interpret the presence of my SV in DGV or DDD databases?

DGV is populated with healthy samples whereas DDD is presenting affecting patients.
The presence of an SV from your sample in DGV or DDD does not necessarily imply a disease causing event. Healthy carriers of pathogenic SV do exist in either databases. When available allele frequency can be helpful to decide on the status.

Q: Is AnnotSV available for other organisms?

The main objective of AnnotSV is to annotate SV information from human data. By default, all the annotations are based on human specific databases. Nevertheless, some additional annotation files can be added for mouse. If you are interested, please see the specific mouse README file.

Q: Is there an option to just generate SV “split” by gene?

You can choose to keep only the split annotation lines thanks to the "-typeOfAnnotation" option.

Q: I am unable to run the code on the input files provided. It crashes on the Repeat annotation step due to a bad_alloc error. Do you have any ideas on why this is happening?

AnnotSV needs to be run with an appropriate RAM (depending of the annotations used). Setting your system to allocate 10 Go should solve the problem.

Q: I'm getting the error: “ANNOTSV environment variable not specified. Please define it before running AnnotSV. Exit”. How can I fix this problem?

ANNOTSV is the environment variable defining the installation path of the software.

• In csh, you can define it with the following command line:
setenv ANNOTSV /path_of_AnnotSV_installation/bin
• In bash, you can define it with the following command line:
export ANNOTSV=/path_of_AnnotSV_installation/bin

I advise you to save the good command in your .cshrc or .bashrc file.

Q: My annotated SV is intersecting both a benign SV and a pathogenic SV. How can I explain that?

Several possible explanations can be considered:
• The pathogenicity can concern a recessive disease. So the pathogenic SV can be present in the heterozygous state in the healthy population (with a DGV low frequency)
• The pathogenic region of the dbVar SV is not overlapping the DGV SV

Q: I’m getting the error: “-- max size for a Tcl value (2147483647 bytes) exceeded”. How can I fix this problem?

You are probably using AnnotSV to annotate a very large SV input file (from a large cohort). Thus you are facing a memory issue either caused by the current machine specification or the programming language used for AnnotSV (Tcl). To solve this you can split your input file into smaller files, run AnnotSV and then later merge them into a single output file. This will be fixed in a future release.

Q: For a VCF with only “BND” events, which refers to breakpoints, how are these being shown in the AnnotSV output when SVminSize is set to 50bp? Since a breakpoint start and stop positions only differ by 1bp, I am wondering why these are not filtered out by AnnotSV.

AnnotSV is designed to annotate SV and not SNV/indel from a VCF, which is the aim of the "SVminSize" option.
Actually, SV can be described in three different ways in a VCF file:
• Type1: ref="G" and alt="ACTGCTAACGATCCGTTTGCTGCTAACGATCTAACGATCGGGATTGCTAACGATCTCGGG" (length ❯ SVminSize)
• Type2: alt="❮INS❯", "❮DEL❯", "❮BND❯"...
• Type3: complex rearrangements with breakends: alt="G]17:1584563]"
The “SVminSize” parameter is only used to exclude SNV/indel from the SV of Type1.

Q: How is calculated the “SV length” annotation?

• AnnotSV reports the “SVLEN” value if given in a VCF input file.
• Nevertheless, when it is not provided, AnnotSV calculates the SV length (with "alt length" - "ref length") depending on the description of it in a VCF input file: ref="G" and alt="ACTGCTAACGATCCGTTTGCTGCTAACGATCTAACGATCGGGATTGCTAATCTCGGG"
• Else, AnnotSV calculates the SV length only for deletion, duplication and inversion (with "SVend - SVstart", and with a negative value for deletion). Indeed, this calculation cannot be done for insertion, breakend, translocation...
• Else, the SV length is blank.

Q: What does the candidateGenesFile parameter refer to?

The candidateGenesFile contains the candidate genes of the user. This information is used:
• To improve the ranking of the SV (see the “SV RANKING/CLASSIFICATION” section)
• To filter out the SV annotations that do not overlap a candidate gene (-candidateGenesFiltering yes). In this configuration, only “split” annotations can be reported.

Q: My input bed file contains ~10000 SV, but only ~2000 SV are annotated. Why?

AnnotSV does not annotate:
• The SNV/indel (size<50bp)
• The SV in a bad format
• The SV for which the “END” is not defined.
AnnotSV creates a report of unannotated variants (“.unannotated.tsv” file).
If you want to annotate SNV/indel, please set the -SVminSize to 1.

Q: How overlaps (%) are calculated?

AnnotSV provides 3 different types of annotations:

- An annotation with features overlapping the SV (DGV, 1000 genomes…):


- An annotation with features overlapped with the SV (pathogenic SV from dbVar, promoters, enhancers…):


- A gene-based annotations
Each gene overlapped by the SV to annotate is reported (even with 1bp overlap).

Q: Why not to use a reciprocal overlap with features overlapped with the SV to annotate?

Let’s take the example of pathogenic SV as features.

=> AnnotSV would lose some information if using a reciprocal overlap.

Q: What are the minimal info/headers needed in a VCF input file to run AnnotSV?

AnnotSV is using the VCF format following official specification VCF v4.3. Nevertheless, some flexibility is allowed:
- No meta-information line (prefixed with “##”) is required
But the following is mandatory:
- A header line (prefixed with “#CHROM”)
- The following INFO keys are required: GT, SVLEN, END and SVTYPE.

In order to be able to classify the SV, the "SVTYPE" values should be one of DEL, INS, DUP, INV, CNV, BND, LINE1, SVA, ALU.
In addition, the ❮CN0❯, ❮CN2❯, ❮CN3❯... angle-bracketed ID from the "ALT" column should be used in case of SVTYPE=CNV in the INFO column.

In order to use the “snvIndelPASS” option (using of the variants only if they passed all filters during the calling), the FILTER column value is mandatory.

Q: I’m getting the error: “ERROR: chromosome sort ordering for file … is inconsistent with other files”. How can I fix this problem?

The locale specified by your environment can affect the traditional “sort” order that uses native byte values. Please, set LC_ALL=C.
In csh, you can define it with the following command line:
setenv LC_ALL C
In bash:
export LC_ALL=C

Q: I’m getting the error: « unexpected token "END" at position 0; expecting VALUE » while running Exomiser. How can I fix this problem?

You are facing a memory issue. Please, try increasing RAM/MEM on your compute node.