Difference between revisions of "Architecture of Gscope"
(10 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
==Gscope from the begining== | ==Gscope from the begining== | ||
Odile Lecompte, Olivier Poch and Raymond Ripp had to annotate the genome of ''Pirococcus abyssi''. | Odile Lecompte, Olivier Poch and Raymond Ripp had to annotate the genome of ''Pirococcus abyssi''. | ||
− | |||
Starting with the DNA sequence of ''Pyrococcsu abyssi'' (1765120 bases) we determined the genes and tried to find the function of each protein. | Starting with the DNA sequence of ''Pyrococcsu abyssi'' (1765120 bases) we determined the genes and tried to find the function of each protein. | ||
Line 13: | Line 12: | ||
[[Images:GscopeBoard.png]] | [[Images:GscopeBoard.png]] | ||
− | === | + | ==Naming and general organisation== |
− | |||
− | + | ===PAB=== | |
− | |||
− | |||
− | + | The Pabyssi gscope project handles nucleic and protein sequences. Each one is represented as a rectangular box on the GscopeBoard. | |
− | The Pabyssi gscope project handles | ||
We called it a PAB (from Pyrococcus AByssi) (and were never able to find a more generic name ... it could be Box or SeqEntity or ???) | We called it a PAB (from Pyrococcus AByssi) (and were never able to find a more generic name ... it could be Box or SeqEntity or ???) | ||
Line 27: | Line 22: | ||
Each one had an id PAB0001, PAB0002, ... (Numerotation may not be consecutive) | Each one had an id PAB0001, PAB0002, ... (Numerotation may not be consecutive) | ||
− | The procedure ListeDesPABs returns the list of all | + | The procedure ListeDesPABs returns the list of all these ids. |
We use very often : | We use very often : | ||
Line 35: | Line 30: | ||
Since Pabyssi I didn't change the name of this central procedure. | Since Pabyssi I didn't change the name of this central procedure. | ||
+ | |||
+ | To give a name to each 'PAB' of a project we use a '''prefix''' (ex. PAB or BOX or EHomsa) and a 1, 2, 3, 4 or 5 digits number (ex. PAB2359 or EHoma12345) | ||
+ | |||
+ | ===Gscope File Organisation=== | ||
+ | |||
+ | |||
+ | See more at [[Gscope Project]] | ||
+ | |||
+ | Each Gscope project (we call it MyProject) is located in one directories tree. Starting at RepertoireDuGenome (normally /genomics/link/MyProject) | ||
+ | |||
+ | Suppose the prefix is MP and it concerns 2345 proteins ... from MP00001 to MP2345 | ||
+ | |||
+ | In directory /genomics/link/MyProject you'll find the directories | ||
+ | * nuctfa containing the fasta file for each nucleic sequence (from MP0001 to MP2345) | ||
+ | * nucembl containing the embl format | ||
+ | * prottfa containing the fasta file for each proteic sequence (from MP0001 to MP2345) | ||
+ | * protembl containing the embl format | ||
+ | |||
+ | * blastp | ||
+ | * ballast | ||
+ | * msf | ||
+ | * msfleon | ||
+ | * macsimXml | ||
+ | * macsimcRsf | ||
+ | |||
+ | thes subdirectories are the default directories containing the default corresponding file type '''BUT''' we could imagine to create different blasts for different datsbases. In that case we could have | ||
+ | |||
+ | * blastpProtall | ||
+ | * blastpUniref | ||
+ | |||
+ | and to keep the default directory we use link | ||
+ | |||
+ | blastp -> blastpProtall | ||
+ | |||
+ | The '''most important''' thing is to have directories containin blastp blastn msf in their name ... this allows Gscope to know which kind of file it contains. (unfortunately we don't use file extension !!!) | ||
+ | |||
+ | |||
+ | Another important subdirectory is '''infos'''. It contains the most used information of each PAB. These information are proveided by ExtraitInfo | ||
+ | |||
+ | ExtraitInfo EHomsa00001 lists all available infod | ||
+ | ExtraitInfo EHomsa00001 AC: return the AC field | ||
+ | |||
+ | |||
+ | |||
+ | ===beton and fiches=== | ||
+ | |||
+ | * the beton subdirectory contains thing which should never change | ||
+ | ** typically the miniconfig file | ||
+ | |||
+ | * the fiches subdirectory contains things concerning the project itself | ||
+ | ** bornesdespabs lists all PAB with their namme EHomsa00001 to EHomsa21006 and their position on the GscopeBoard | ||
+ | ** lesgenomescomplets gives the list of the interesting complete genomes (mainly an empty list) | ||
+ | ** niag.txt contains the lookup table between the GscopeId EHomsa12345, the Uniprot id MET_HUMAN, the Uniprot access Q86W50 and the genename METTL16 | ||
+ | ** MyGenesFromGo.txt | ||
+ | ** MyGOsFromGenes.txt | ||
+ | ** etc. | ||
+ | |||
+ | ==FOLLOWING FILES MUST BE THERE== | ||
+ | |||
+ | Notice that each Gscope project must have | ||
+ | * a project name '''MyProject''' (the name of the directory /genomics/link/MyProject) | ||
+ | * a /genomics/link/MyProject/beton/'''miniconfig''' file | ||
+ | * the /genomics/link/fiches/'''bornesdespabs | ||
+ | ''' | ||
+ | otherwize Gscope can't start but asks you to give the information to create thes compulsory data. | ||
+ | |||
+ | Notice that '''all''' GScope Projects are located in /genomics/link therefore they must have '''different''' project names. | ||
+ | |||
+ | Notice also that the project name is absolutely not linked to the prefix (... it's only a habbit ...) | ||
+ | |||
+ | The project name is only the name of the directory in /genomics:Link therefore you can change the name of the directory whenever you want (ie. MyNewProj) ... but dont't forget to do | ||
+ | setgscoperr MyNewProj |
Latest revision as of 10:09, 10 January 2018
Architecture of Gscope
To undestand how it is today we need a brief overview of the Historical Evolution or Evolutionary History of Gscope
Contents
Gscope from the begining
Odile Lecompte, Olivier Poch and Raymond Ripp had to annotate the genome of Pirococcus abyssi.
Starting with the DNA sequence of Pyrococcsu abyssi (1765120 bases) we determined the genes and tried to find the function of each protein.
For that we needed to have an interactive visualization tool allowing to show the sequences, blast outputs, multiple alignments and many other things.
Naming and general organisation
PAB
The Pabyssi gscope project handles nucleic and protein sequences. Each one is represented as a rectangular box on the GscopeBoard.
We called it a PAB (from Pyrococcus AByssi) (and were never able to find a more generic name ... it could be Box or SeqEntity or ???)
Each one had an id PAB0001, PAB0002, ... (Numerotation may not be consecutive)
The procedure ListeDesPABs returns the list of all these ids. We use very often :
foreach Nom [ListeDesPABs] { DoSomething $Nom }
Since Pabyssi I didn't change the name of this central procedure.
To give a name to each 'PAB' of a project we use a prefix (ex. PAB or BOX or EHomsa) and a 1, 2, 3, 4 or 5 digits number (ex. PAB2359 or EHoma12345)
Gscope File Organisation
See more at Gscope Project
Each Gscope project (we call it MyProject) is located in one directories tree. Starting at RepertoireDuGenome (normally /genomics/link/MyProject)
Suppose the prefix is MP and it concerns 2345 proteins ... from MP00001 to MP2345
In directory /genomics/link/MyProject you'll find the directories
- nuctfa containing the fasta file for each nucleic sequence (from MP0001 to MP2345)
- nucembl containing the embl format
- prottfa containing the fasta file for each proteic sequence (from MP0001 to MP2345)
- protembl containing the embl format
- blastp
- ballast
- msf
- msfleon
- macsimXml
- macsimcRsf
thes subdirectories are the default directories containing the default corresponding file type BUT we could imagine to create different blasts for different datsbases. In that case we could have
- blastpProtall
- blastpUniref
and to keep the default directory we use link
blastp -> blastpProtall
The most important thing is to have directories containin blastp blastn msf in their name ... this allows Gscope to know which kind of file it contains. (unfortunately we don't use file extension !!!)
Another important subdirectory is infos. It contains the most used information of each PAB. These information are proveided by ExtraitInfo
ExtraitInfo EHomsa00001 lists all available infod ExtraitInfo EHomsa00001 AC: return the AC field
beton and fiches
- the beton subdirectory contains thing which should never change
- typically the miniconfig file
- the fiches subdirectory contains things concerning the project itself
- bornesdespabs lists all PAB with their namme EHomsa00001 to EHomsa21006 and their position on the GscopeBoard
- lesgenomescomplets gives the list of the interesting complete genomes (mainly an empty list)
- niag.txt contains the lookup table between the GscopeId EHomsa12345, the Uniprot id MET_HUMAN, the Uniprot access Q86W50 and the genename METTL16
- MyGenesFromGo.txt
- MyGOsFromGenes.txt
- etc.
FOLLOWING FILES MUST BE THERE
Notice that each Gscope project must have
- a project name MyProject (the name of the directory /genomics/link/MyProject)
- a /genomics/link/MyProject/beton/miniconfig file
- the /genomics/link/fiches/bornesdespabs
otherwize Gscope can't start but asks you to give the information to create thes compulsory data.
Notice that all GScope Projects are located in /genomics/link therefore they must have different project names.
Notice also that the project name is absolutely not linked to the prefix (... it's only a habbit ...)
The project name is only the name of the directory in /genomics:Link therefore you can change the name of the directory whenever you want (ie. MyNewProj) ... but dont't forget to do
setgscoperr MyNewProj