Difference between revisions of "Vep"
(→Variant Effect Predictor) |
(→Variant Effect Predictor) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
* Source: http://www.ensembl.org/info/docs/tools/vep/script/index.html | * Source: http://www.ensembl.org/info/docs/tools/vep/script/index.html | ||
+ | === Installation === | ||
* Installation on studio with Raymond | * Installation on studio with Raymond | ||
** installation in /biolo/vep | ** installation in /biolo/vep | ||
Line 50: | Line 51: | ||
- OK! | - OK! | ||
− | + | * Install local cache for database connections for homo sapiens | |
− | |||
The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP | The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP | ||
− | Cache files will be stored in / | + | Cache files will be stored in /biolo/vep/cache |
Do you want to install any cache files (y/n)? y | Do you want to install any cache files (y/n)? y | ||
− | Cache directory / | + | Cache directory /biolo/vep/cache does not exists - do you want to create it (y/n)? y |
Downloading list of available cache files | Downloading list of available cache files | ||
Line 68: | Line 68: | ||
... | ... | ||
− | ? | + | ? 26 |
− | |||
− | |||
− | |||
- downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz | - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz | ||
** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s) | ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s) | ||
Line 79: | Line 76: | ||
The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks. | The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks. | ||
− | FASTA files will be stored in / | + | FASTA files will be stored in /biolo/vep/cache |
Do you want to install any FASTA files (y/n)? y | Do you want to install any FASTA files (y/n)? y | ||
FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download): | FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download): | ||
Line 93: | Line 90: | ||
** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s) | ** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s) | ||
Extracting data | Extracting data | ||
− | The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta / | + | The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /biolo/vep/cache/homo_sapiens/73/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa" |
Success | Success | ||
* Configure | * Configure | ||
+ | ** Add plugins | ||
+ | *** Download latest [https://github.com/ensembl-variation/VEP_plugins archieve of vep plugins] | ||
+ | *** Move all the plugins in the plugin directory /biolo/vep/cache/Plugins | ||
+ | ** Create the configuration file vep.ini in /biolo/vep/cache | ||
+ | |||
+ | ########################## | ||
+ | ## general features flags | ||
+ | ########################## | ||
+ | force_overwrite 1 | ||
+ | verbose 1 | ||
+ | species homo_sapiens | ||
+ | fork 4 | ||
+ | |||
+ | ########################### | ||
+ | ## output annotation flags | ||
+ | ########################### | ||
+ | sift b # the SIFT prediction and score, with both given as prediction(score) | ||
+ | polyphen b # the PolyPhen prediction and score | ||
+ | regulatory 1 # Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site. | ||
+ | numbers 1 # Adds affected exon and intron numbering to to output. | ||
+ | domains 1 # Adds names of overlapping protein domains to output. | ||
+ | |||
+ | terms so | ||
+ | |||
+ | |||
+ | ################################ | ||
+ | ## ouput indentifications flags | ||
+ | ################################ | ||
+ | hgvs 1 # Add HGVS nomenclature based on Ensembl stable identifiers to the output. | ||
+ | symbol 1 # Adds the gene symbol (e.g. HGNC) (where available) to the output. | ||
+ | ccds 1 # Adds the CCDS transcript identifer (where available) to the output. | ||
+ | protein 1 # Add the Ensembl protein identifier to the output where appropriate. | ||
+ | canonical 1 # Adds a flag indicating if the transcript is the canonical transcript for the gene. | ||
+ | biotype 1 # Adds the biotype of the transcript. Not used by default | ||
+ | xref_refseq 1 # Output aligned RefSeq mRNA identifier for transcrip | ||
+ | |||
+ | |||
+ | |||
+ | ############################# | ||
+ | ## Co-located variants flags | ||
+ | ############################# | ||
+ | gmaf 1 # Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output. | ||
+ | #maf_1kg 1 # Add MAF from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output. | ||
+ | maf_esp 1 # Include MAF from NHLBI-ESP populations. | ||
+ | pubmed 1 # Report Pubmed IDs for publications that cite existing variant. | ||
+ | check_alleles 1 # When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel. | ||
+ | check_svs 1 # Checks for the existence of structural variants that overlap your input. | ||
+ | ##failed 1 # When checking for co-located variants, by default the script will exclude variants that have been flagged as failed. | ||
+ | |||
+ | |||
+ | ############################# | ||
+ | ## Filtering and QC options | ||
+ | ############################# | ||
+ | #check_ref 1 # Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database. | ||
+ | #coding_only 1 # Only return consequences that fall in the coding regions of transcripts. | ||
+ | no_intergenic 1 # Do not include intergenic consequences in the output. | ||
+ | #most_severe 1 # Output only the most severe consequence per variation. | ||
+ | #summary 1 # Output only a comma-separated list of all observed consequences per variation. | ||
+ | #per_gene 1 # Output only the most severe consequence per gene. | ||
+ | filter_common 1 # Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters. | ||
+ | |||
+ | * Creation of an alias | ||
+ | vep: aliased to /biolo/vep/variant_effect_predictor.pl --force_overwrite --cache --dir /biolo/vep/cache | ||
+ | |||
+ | === Usage === | ||
+ | * Set environment | ||
+ | > setvep | ||
+ | vep -i myfile.vcf | ||
− | + | * usage | |
− | + | > vep -i input.vcf -o ouput.vcf > output.log | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 14:40, 15 October 2013
Date : 2013/10/14 Author : kchennen
Variant Effect Predictor
Installation
- Installation on studio with Raymond
- installation in /biolo/vep
- Download latest archieve (v73)
> curl "http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-tools/scripts/variant_effect_predictor.tar.gz?view=tar&root=ensembl&pathrev=branch-ensembl-73" | tar xz > cd variant_effect_predictor
- Install the API with a local cache in /biolo/vep/cache
> perl INSTALL.pl -c /biolo/vep/cache Hello! This installer is configured to install v73 of the Ensembl API for use by the VEP. It will not affect any existing installations of the Ensembl API that you may have. It will also download and install cache files from Ensembl's FTP server. Checking for installed versions of the Ensembl API...done It looks like you already have v73 of the API installed. You shouldn't need to install the API Skip to the next step (n) to install cache files Do you want to continue installing the API (y/n)?y Setting up directories Downloading required files - fetching ensembl - unpacking ./Bio/tmp/ensembl.tar.gz - moving files - fetching ensembl-variation ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation.tar.gz?root=ensembl&view=tar&only_with_tag=branch-ensembl-73 ==> 301 Moved ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation.tar.gz?pathrev=branch-ensembl-73&root=ensembl&view=tar ==> 200 OK (8s) - unpacking ./Bio/tmp/ensembl-variation.tar.gz - moving files - fetching ensembl-functgenomics ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics.tar.gz?root=ensembl&view=tar&only_with_tag=branch-ensembl-73 ==> 301 Moved ** GET http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics.tar.gz?pathrev=branch-ensembl-73&root=ensembl&view=tar ==> 200 OK (5s) - unpacking ./Bio/tmp/ensembl-functgenomics.tar.gz - moving files - fetching BioPerl ** GET http://bioperl.org/DIST/BioPerl-1.6.1.tar.gz ==> 200 OK (15s) - unpacking ./Bio/tmp/BioPerl-1.6.1.tar.gz - moving files Testing VEP script - OK!
- Install local cache for database connections for homo sapiens
The VEP can either connect to remote or local databases, or use local cache files. Using local cache files is the fastest and most efficient way to run the VEP Cache files will be stored in /biolo/vep/cache Do you want to install any cache files (y/n)? y Cache directory /biolo/vep/cache does not exists - do you want to create it (y/n)? y Downloading list of available cache files The following species/files are available; which do you want (can specify multiple separated by spaces): 1 : ailuropoda_melanoleuca_vep_73.tar.gz 2 : anas_platyrhynchos_vep_73.tar.gz 3 : anolis_carolinensis_vep_73.tar.gz ... 25 : homo_sapiens_refseq_vep_73.tar.gz 26 : homo_sapiens_vep_73.tar.gz ... ? 26 - downloading ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ** GET ftp://ftp.ensembl.org:21/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz ==> 200 OK (305s) - unpacking homo_sapiens_vep_73.tar.gz Download FASTA files for homo sapiens The VEP can use FASTA files to retrieve sequence data for HGVS notations and reference sequence checks. FASTA files will be stored in /biolo/vep/cache Do you want to install any FASTA files (y/n)? y FASTA files for the following species are available; which do you want (can specify multiple separated by spaces, "0" to install for species specified for cache download): 1 : ailuropoda_melanoleuca 2 : anas_platyrhynchos 3 : ancestral_alleles ... 26 : homo_sapiens ... ? 26 Downloading Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ** GET ftp://ftp.ensembl.org:21/pub/release-73/fasta//homo_sapiens/dna/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa.gz ==> 200 OK (99s) Extracting data The FASTA file should be automatically detected by the VEP when using --cache or --offline. If it is not, use "--fasta /biolo/vep/cache/homo_sapiens/73/Homo_sapiens.GRCh37.73.dna.primary_assembly.fa" Success
- Configure
- Add plugins
- Download latest archieve of vep plugins
- Move all the plugins in the plugin directory /biolo/vep/cache/Plugins
- Create the configuration file vep.ini in /biolo/vep/cache
- Add plugins
########################## ## general features flags ########################## force_overwrite 1 verbose 1 species homo_sapiens fork 4 ########################### ## output annotation flags ########################### sift b # the SIFT prediction and score, with both given as prediction(score) polyphen b # the PolyPhen prediction and score regulatory 1 # Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site. numbers 1 # Adds affected exon and intron numbering to to output. domains 1 # Adds names of overlapping protein domains to output. terms so ################################ ## ouput indentifications flags ################################ hgvs 1 # Add HGVS nomenclature based on Ensembl stable identifiers to the output. symbol 1 # Adds the gene symbol (e.g. HGNC) (where available) to the output. ccds 1 # Adds the CCDS transcript identifer (where available) to the output. protein 1 # Add the Ensembl protein identifier to the output where appropriate. canonical 1 # Adds a flag indicating if the transcript is the canonical transcript for the gene. biotype 1 # Adds the biotype of the transcript. Not used by default xref_refseq 1 # Output aligned RefSeq mRNA identifier for transcrip ############################# ## Co-located variants flags ############################# gmaf 1 # Add the global minor allele frequency (MAF) from 1000 Genomes Phase 1 data for any existing variant to the output. #maf_1kg 1 # Add MAF from continental populations (AFR,AMR,ASN,EUR) of 1000 Genomes Phase 1 to the output. maf_esp 1 # Include MAF from NHLBI-ESP populations. pubmed 1 # Report Pubmed IDs for publications that cite existing variant. check_alleles 1 # When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel. check_svs 1 # Checks for the existence of structural variants that overlap your input. ##failed 1 # When checking for co-located variants, by default the script will exclude variants that have been flagged as failed. ############################# ## Filtering and QC options ############################# #check_ref 1 # Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database. #coding_only 1 # Only return consequences that fall in the coding regions of transcripts. no_intergenic 1 # Do not include intergenic consequences in the output. #most_severe 1 # Output only the most severe consequence per variation. #summary 1 # Output only a comma-separated list of all observed consequences per variation. #per_gene 1 # Output only the most severe consequence per gene. filter_common 1 # Shortcut flag for the filters below - this will exclude variants that have a co-located existing variant with global MAF > 0.01 (1%). May be modified using any of the following freq_* filters.
- Creation of an alias
vep: aliased to /biolo/vep/variant_effect_predictor.pl --force_overwrite --cache --dir /biolo/vep/cache
Usage
- Set environment
> setvep vep -i myfile.vcf
- usage
> vep -i input.vcf -o ouput.vcf > output.log