Whole Genome Data

AMP PD includes whole genome sequencing data for most study participants. All sequencing was performed by Macrogen and USUHS using the Illumina HiSeq XTen sequencer with samples coming from whole blood.

Quality control of sequenced data was performed by Hampton and Hirotaka from Datatecnica.

 

Data Processing

Data processing was performed on the Google Cloud Platform. All data processing was performed against Build 38 of the Human Genome reference (GRCh38DH, 1000 Genomes Project version).


Single Sample Processing

FASTQs were processed using the Broad Institute's implementation of the Functional Equivalence Pipeline to produce alignments (output as CRAM files) and variant calls (output as gVCF files).


Joint Genotyping

After single sample processing was completed joint genotyping, using the Broad Institute's Joint Genotyping pipeline, was performed on the gVCF files.


Variant Annotations

Variant annotations add variant identifiers and gene identifiers as annotations. The annotation fields can be seen on the WGS Variant Effect Predictor Fields page. Annotations were generated on the joint genotyped variants using the Variant Effect Predictor (VEP).

 

Whole Genome Sequencing Methodology: Alignment approach; Variant Calling approach; Variant of annotation

Processed WGS Totals

Control Case
Cohort No Mutations With Mutations No Mutations With Mutations
BioFIND 69 1 90 9
HBS 217 10 585 55
PDBP 454 26 772 88
PPMI 194 373 420 364
Total 934 410 1867 516

Processe WGS Totals - Alternative View

  Recruitment Case(PD) Control (HC)
  Mutation + - + -
BioFIND   9 90 1 69
HBS   55 585 10 217
PPMI Original 47 365 10 184
  SWEDD 0 0 0 0

 

WGS Data Dictionary

If you want to download a version of the full AMP PD Whole Genome Sequencing Data Dictionary, click one of the buttons below for a specific format. 

Data Availability

The following are available to registered researchers at this time:

Per sample

Joint genotyped and annotated variants

CRAM and gVCF files:
gs://amp-pd-genomics/single_sample/wgs
VCF files:
gs://amp-pd-genomics/releases/2019_v1beta_genomics
BigQuery table: amp-pd-research:2019_v1beta_genomic


As part of the continuous AMP PD Knowledge Platform improvement AMP PD expects to provide variant calls and individual genotypes based on whole genome sequence data provided by the AMP-PD consortium for the previously sequenced and de-identified DNA samples. Electronic access to TOPmed sequence data stored in the google cloud will be provided by AMP PD in the near future.

 

If you want to download a version of the full AMP PD Whole Genome Sequencing Data Dictionary, click one of the buttons below for a specific format. 

WGS Workflow Overview & Execution

Cromwell: execution engine from the Broad institute. Runs workflows written in the workflow definition  language (WDL)  

MySQL: database of submitted, running, and completed jobs  

Cromwell Workspace: Directory in Google Cloud Storage used by Cromwell to communicate with workflow tasks

Pipelines API: <todo>

The following steps detail the process of turning FASTQs into CRAMs and gVCFs uses two workflows from the Broad:

  1. Operator submits workflow request to Cromwell on a REST API - listening on port 8000
  2. Cromwell creates a subdirectory in gs://<bucket/cromwell_executions for each workflow
  3. Repeat until workflow completes
    • Cromwell creates task-specific directories in the workflow directory and populates it with the script to run
    • Cromwell calls the Pipelines API to launch a VM to run the step
    • Pipelines API downloads input files, executes the task, and writes outputs back to the task-specific directory in the workflow.
    • Cromwell gets "job status" information both from the Pipelines API and the task-specific directory of the workflow
  4. Operator copies outputs to "final" location

WGS Workflow Overview

WGS Quality Control Process

Hampton and Hirotaka from Datatecnica have performed QC analysis on 4,047 AMP PD WGS samples. This analysis has included:

  • Sample quality
    • Contamination (Freemix < 3%)
    • Coverage (Mean coverage < 25)
    • WGS metric outliers (TiTv < 2)
    • Missingness (missingness genotype rates per sample > 5%)
    • Genetic data checks against
       
  • Duplication check
    • Concordance against NeuroX data
    • Clinically reported sex
    • Excessive heterogeneity
    • Clinically reported race/ethnicity
       

WGS QC Process

Sample Quality Checks

  1. Contamination - Some samples show clear signs of contamination as reported by VerifyBAMId. Contaminated samples were removed from AMP PD.
    Pass/Fail Criteria: VerifyBamID FREEMIX >= 0.03
  2. Read Coverage - In sequencing experiments, some samples have mean coverage. Data from this run associated with that particular sample are excluded from joint calling and flagged for wet laboratory follow-up.
    Pass/Fail Criteria: Mean_Coverage >= 25 reads per variant
  3. WGS metric outliers - Low transition transversion ratio (TiTv)
    Pass/Fail Criteria: Failing samples at values < 2 based on dbSNPs
  4. Missingness - Refers to missing genotype rates per sample
    Pass/Fail Criteria: Sample with > 0.05% missingness

Genetic Data Checks

  1. Duplication check - Some samples matched their NeuroX data, but also matched another WGS sample (which matched its NeuroX data). This indicates that the same individual has been included in AMP PD more than once. Some samples had no NeuroX data to match against, but matched another WGS sample. This indicates that either this is the same individual has been included in AMP PD, or one of the samples was mislabeled. The higher quality WGS samples were used in joint genotyping and the lower quality WGS samples were made available in AMP PD, but not in joint genotyping.
    Pass/Fail Criteria: Software King Relatedness = dup/MZTwi
  2. Concordance against NeuroX data - For some samples, there was NeuroX data available, but the WGS sample did not match this NeuroX data based on rates of genotype concordance. The WGS data is a superset of the NeuroX data, so samples with only WGS were not included in this phase of analysis. Discordant samples were removed from AMP as this suggests a problem with the DNA itself.
    Pass/Fail Criteria: Software King Relatedness !=dup/MZTwin
  3. Clinically reported sex - Sex estimated from WGS was checked against data from self-report. Discordant sex suggests a sample mix-up generally. These samples were removed from joint calling and the biological data used for the assay and further assays were flagged for caution going forward.
    Pass/Fail Criteria: M=F or F=M, Blanks ignore
  4. Excessive heterogeneity - Computes observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (<observed hom. count> - <expected count>) / (<total observations> - <expected count>)).
    Pass/Fail Criteria: F > +/- 0.15
  5. Clinically reported race/ethnicity - Samples for subjects who reported white and are admix or reported multiracial and are genetically European. Ancestry outliers are determined using PCA and comparing to hapmap samples. Any sample within a distribution of plus/minus 6 standard deviations from the mean in PC1 and PC2 are considered to be part of that population genetically.
     
    Excluded Flagged
    Genetically inferred African or Asian = clinically reported “white” Genetically inferred Admix = anything other than “mixed race”
    Genetically inferred European = clinically reported “african/asia”  Clinically reported “unknown”