AmdConsortium

From Wiki

Revision as of 08:54, 9 February 2012 by Ripp (Talk | contribs)
Jump to: navigation, search

The AMD Consortium within EVI-Genoret Database by Raymond Ripp

The Genoret Database aims to host all data for the Amd Consortium.

See the AMD Consortium Welcome Page

Contents

Introduction

  • I got a lot of different files from all centers, from 2006 up to october 2010.
  • Files came from Baltimore, Bonn, Creteil, Jerusalem, London, Paris, Southampton and from the CNG (Evry)
  • Very often people didn't use the same identities, column title and content ...
  • I tried to run as much verifications as possible to detect inconsistency, error, missings, etc.

Additionnal information from Diana 20111004

Diana sent us the Excell file Final_dat_AMD_P28_29_collections.xls containing additionnal and existing CngStatus.

I wrote the tcl program VerificationP28 :

  • 1 new Creteil Nat2-285
  • 1 Baltimore has wrong sex dasn AMdCommon (or CgnStatus ...) 464001
  • 262 new London for AmdCommon
  • 1718 new London for CngStatus (0 before)
  • 18 new Southapton

Dianan sent us also AMD_JYates_110529.xls containing Indexes, Barcode and LabCenter. I wrote CngBarcode

Sql tables

Original table

I created a Sql table for each center independently from others with the raw data.

So I create the tables Baltimore, BaltimoreSeptember, Bonn, Creteil, Jerusalem, London, Paris, Southampton.

AmdCommon

These tables where very different and I decided to create a common table AmdCommon with following columns

 amdcentre amdid amdbirth amdsex amdsmoke amddiagod amddiagos amddiagodos amdwhicheye amdownpk pk

amdownpk is the link to the original data.

CngStatus

Only a subset of these patients were genotyped. So I created the CngStatus table with the information I got grom the CNG

 lab_center pedigree individual father mother sex status coll_comment barcode pk diag basicdiag birth 

The most important columns are

  • barcode corresponding to the blood sample
  • diag obtained by a mix of amddiagod amddiagos amddiagodos
  • basicdiag obtained from diag indicating ga cnv control amd

Managing BasicDiag from Diag

My Tcl functions BasicDiag and SetBasicDiagForCng allow to manage it.

  • BasicDiag uses the lookup table from the file CngStatusDiag.txt
  • SetBasicDiagForCng calls BasicDiag for all diag and if it doesn't find it, it appends the missing diag to CngStatusDiagToDo.txt
  • CngStatusDiagToDo.txt can be viewed from the AmdConsortium welcome page ... and I hope we will solve all remaining problems ...

NoFundus and ambigious patients from Bonn 2011/02/09

Jürgen Kaschkötö sent us two files

  • DeBonn/Bonn-Regensburg-Phenotype.xls containing 84 lines as for example
 H-028-GA;;;;Late AMD GA;Late AMD GA
 H-030-GA;SO25121919;H-030-2-OS;Individual was SO25121919 , now H-030-2-OS;Late AMD GA;Late AMD GA
 H-053-GA;NM27021930;H-053-2-MN;Individual was NM27021930 , now H-053-2-MN;Late AMD GA;Late AMD GA
 etc.

these patients where all with NoFundus

  • DeBonn/AmbigiousSolved.txt cointaining 20 lines as forexample
 L-003-GA L-003-2-AK
 L-007-GA L-007-2-GH
 L-018-GA L-018-2-EN
 etc.

the entries L-123-GA were ambigious because L-003-GA <==> ambigious L-003-1-HW L-003-2-AK

I decide NOT TO UPDATE the origial database which was obtained by a sql request to Tuebingen.

I updated AmdCommon and CngStatus on 2011/02/16 with the NoFundus patient. I'll do it soon for the ambigious.

Integration of AMD_COLL_xxx files containing all genotypes 2010/06/18

On https://beaune.cng.fr/amd/ I found the files

	 	 -
 AMD_COLL_AUDO.tar	17-Jun-2010 11:50 	35M
 AMD_COLL_LOTERY.tar	17-Jun-2010 11:49 	128M
 AMD_COLL_SCHOLL.tar	15-Jun-2010 15:29 	70M
 AMD_COLL_SOUIED.tar	17-Jun-2010 11:52 	70M
 AMD_COLL_ZACK.tar	17-Jun-2010 11:53 	41M
 results_AMD.txt.gz	17-Sep-2009 10:40 	11M
  • result_AMD.txt.gz was the same I already had.
  • I did tar -xvf AMD_COLL_xxxx.tar creating the correspondig directory containing (ie Audo)
64465487 2008-09-01 10:46 Final_dat_AMD_COLL_AUDO.txt
     717 2008-08-12 10:01 report.txt
26731497 2008-08-12 09:52 success_mark.txt
    3139 2008-08-12 09:52 success_samp.txt

The Final_dat_AMD_COLL_xxx contain

 Family Individual Father Mother Sex Status rs12354060 rs12184279 rs12564807 rs3115860 ... ... ... (with 332224 rs)

I created a small one called xxxx.txt (Audo, Lotery, Scholl, Souied, Zack) with the first 10 columns, run gscope CngGenoPheno and got the CngGenoPheno list available on the AmdConsortium welcome page 'Phenotypic Information.

  • Only a few references from Bonn are still missing

but why are most of Individual like 'ZM23041920' (I found them because I had the coll_comment field containing 'Individual was ZM23041920 , now H-208-2-MZ' )

  • Paris, Creteil and Southampton are ok, Baltimore seems to be ok but there are still multiple diagnosis ...

Integration of the CNG Status 2008/10/01

  1. Reading the file AMD_verif_statuts_Leveillard_envoye_1oct08.xls
    1. I normalized the headers (without blank, no /)
    2. I set sex to M or F instead of 1 or 2
    3. I replaced the centre with respectively Bonn, Creteil ...
    4. I replaced NAT2-xyz by xyz for Creteil (some sex values don't correpond)
    5. I remove TL to obtain the Paris ID
    6. I took the beginning of the Bonn's Ids (L-060-GA => L-060- but some Ids are ambiguous L-060-1-BH or L-060-2-KS ?
  2. Reading the files snplist_NXNL2.xlt snplist_TXNL6.xlt genotypeNXNL2.xls genotypeTXNL6.xls
    1. convert to .csv
    2. remove the comment lines at beginning

Integration of the Quality Control by Raymond Ripp 2008/04

2010/10/06 I have to integrate QC_results_AMD_Don_Zack_complete_sent_7april08.xls in the AllTogether table

With this qc_alltogether table I could feed the AmdCommon in which we mist a lot Baltimore

Comparing diagnosis from sep 2007 with apr 2008 I did minor corrections in Apr 2008 (upper case lower cae etc,) and major differences are listed below

   2	004001	DiagSep="	"	DiagApr2008="High Risk AMD"
   3	004001	DiagSep="	"	DiagApr2008="High Risk AMD"
  31	034002	DiagSep="	"	DiagApr2008="Not confirmed"
  32	034002	DiagSep="	"	DiagApr2008="Not confirmed"
  42	043002	DiagSep="	"	DiagApr2008="Unaffected/No Pathology"
  50	049001	DiagSep="Advanced GA"	DiagApr2008="	"
  52	052004	DiagSep="	"	DiagApr2008="Not confirmed"
  88	107010	DiagSep="Low risk AMD"	DiagApr2008="Unaffected/No Pathology"
 129	139002	DiagSep="Exudative AMD"	DiagApr2008="Advanced GA and Exudative AMD"
 206	185002	DiagSep="	"	DiagApr2008="Exudative AMD"
 301	247020	DiagSep="	"	DiagApr2008="Not confirmed"
 300	247020	DiagSep="	"	DiagApr2008="Not confirmed"
 325	258001	DiagSep="Can't Confirm Dx"	DiagApr2008="Can't Grade"
 457	316002	DiagSep="	"	DiagApr2008="Exudative AMD"
 586	406002	DiagSep="	"	DiagApr2008="Not confirmed"
 631	435006	DiagSep="	"	DiagApr2008="Not confirmed"
 630	435006	DiagSep="	"	DiagApr2008="Not confirmed"
 666	447002	DiagSep="	"	DiagApr2008="High Risk AMD"
 678	452003	DiagSep="	"	DiagApr2008="Not confirmed"
 677	452003	DiagSep="	"	DiagApr2008="Not confirmed"

I set Advanced GA for 049001 because it was already for Pk 49

Important changes are 107010 139002 258001 (written 2010/09/20) I got following files

2007-12-12 10:08 QualityControlCreteil.xls
2007-12-12 10:20 QualityControlCreteil.csv
2007-12-12 10:36 QualityControlBaltimore.csv
2007-12-12 10:40 QualityControlBonn.csv
2007-12-12 10:46 QualityControl.csv
2008-04-07 13:44 QC_results_AMD_Audo_sent_7april08.xls
2008-04-07 14:52 QC_results_AMD_Don_Zack_complete_sent_7april08.xls (this was not integrated before 2010/10/06)
2008-05-14 09:23 QualityControlBonn.xls
2008-05-14 09:23 QualityControlBaltimore.xls
2008-05-15 18:08 AMD_QC_results_sent_batch2_may08.zip
2009-04-07 14:29 QualityControlParis.csv

and created the Quality Control tables (see 'List all available Quality Controls' and per center on the AmdConsortium welcome page)

Integration of the Phenotyping Data by Raymond 2008/02/11

  • Baltimore January version
  1. Betsy Campochiaro sent several Excel files corresponding to an Access database
  2. Creation of a .csv file containing nearby all fields pointed as secondary tables (Genoret Tcl program)
  3. Integration of this file in the csvschema table csvt8
  4. Detection of errors with the Genoret Tcl program
  5. Corrections and updates of small errors
  6. Add ped to father_id and mother_id (2008/01/11)
  7. I replaced the numbers corresponding to the diagnosis with the text of the diagnosis (2008/01/10)
  • Bonn
  1. Direct connection to the Phenotyping Database in Tuebingen (2007-11)
  2. Save the display of all patients as .csv file
  3. Integration of this file in the csvschema table csvt16
  4. They are 3 sub_table in our table ... the 3rd oe should be ok
  5. It seems that now a connection to Tuebingen gives only the 3rd table ...
  6. Keep only the year of birth (Genoret Tcl program)
  7. Keep only the centre Bonn or the FamStudy (ok)
  8. Some values were stored as boolean and couldn't be displayed correctly (a small square). I replaced nul by NO and not null by YES
  • Creteil
  1. Eric Souied sent an Excel file
  2. I removed birthdays and created a csvt table
  3. As the file contains only CNV patients (without this diagnosis) I add CNV to the AmdDiagOdOs column in the common table.
  • Paris CIC
  1. Isabelle Audo sent an Excel file
  • Jerusalem
  1. Itay Chowers sent a new file 2007/12/24 and a .doc file db_codes
    1. I deleted all empty rows especially at the end
    2. and I added the missing empty columns in the rows at the end
    3. I got the db_codes and I replaced the initial_stage_fellow_eye_areds (2 with J=DRY-2, etc.) in the commonview (2008/02/05).
  2. 2008/02/15 I merged different columns to get an integrated diagnosis. It depends now on the firsteyewithamd, etc. See the AmdConsortium Welcome page.
  • London
  1. I got the data from Andrew Webster as Excel file corresponding to his Access database.
  2. I removed names and birthdates
  3. Guillaume created a local SQL database for these level2 data.
  4. He created also a level1 database (as defiened by Tuebingen) with the data from Montpellier and Tuebingen.
  5. I'll extract the London data to create a csvt table (not yet done)
  • Southampton
  1. Angela sent the .xls file
  2. Raymond upload it in AmdConsortium Gallery
  3. Open Excel
  4. Delete the 'nearly empty' columns on the right
  5. Rename the duplicated column Project no.2
  6. Save as .csv
  7. Integrate it as csvt27
  8. Modify birthday to keep only the year. And some dates were written as dd.mm.yy
  9. I merged the cohort, diag_dry, diag_wet_amd, amd_, consolidated_areds in one value (2008/02/05)
Personal tools