AMP PD Use Cases

Quality Control Using Terra & Jupyter Notebooks

"Cloud computing allowed us to speed up the quality control process.  We collaborated with Verily and the Broad Institute and used the Terra platform to both implement standard processing pipeline for whole genome sequence data and perform the quality controls needed for us to be confident in our data.  We expect researchers will find these and other notebooks available useful.  We look forward to future collaborations and seeing what tools the community creates for the analysis of this data."

- Lead WGS WG Scientist

The AMP PD public-private partners, through an AMP PD Whole Genome Sequencing Working Subgroup (WGS WG) worked together to provide quality control (QC) to ensure information from the thousands of whole genome sequences that they generated met inclusion criteria for release on the AMP PD Knowledge Platform.

To make this activity possible, and after a due diligence process, the WGS WG elected to focus this QC work in Terra and Jupyter Notebooks to enable fast collaboration to analyze the 4,047 genomes.

As part of the QC these the WGS WG scientist performed concordance checks with NeuroX data and with the gender identified in associated clinical data. As a result of this effort the following Jupyter notebooks were created and can now be shared with others in the research community to ensure transparency and reproducibility.

Accelerating Discovery Through Cloud Computing

"Cloud computing allowed us to speed up discovery. We collaborated with verily and the Broad Institute to test varying implementations of the standard processing pipeline for exome sequence data on the cohort and population scale."

- AMP PD Collaborator

To make real scientific discoveries possible from so many sources of data, the data had to be reanalyzed for consistency. To reduce the possibility of technical artifacts, scientists had to perform realignment, recalibration, and re-genotyping of the exomes. But there was a problem: none of the consortium members had enough local computational resources.

The team decided to use a fully managed service on the Google Cloud Platform. Scientists ran the Broad Institute’s GATK Best Practices pipeline using Google Genomics, processing the the exomes—starting with raw, unaligned sequence data and leading to a set of variant calls—in just three and a half weeks. The dataset was subsequently used to identify six new risk loci for Parkinson’s disease, helping scientists better understand genetic risks for the disease. 

Even if hardware could have been procured, the effort would have taken months of compute time using local infrastructure. With the Google Cloud Platform, massive datasets can now be analyzed, giving scientists access to virtually unlimited compute resources for large-scale projects.