Chemical Sciences

Genome Analysis Data Platform

A middle-size pharmaceutical company focused on the development and commercialization of innovative therapies and drugs for rare genetic disorders.

Business Goals

  • Develop a large-scale whole-genome genotyping and annotation platform.
  • Achieve end-to-end processing of one sample in hours instead of ordinary days and weeks.
  • Ensure under 2 seconds latency for analytical requests.

Challenge

  • Hundreds of TB of data to ingest, transform, store, and analyze.
  • Tens of conditional steps of the workflow to be managed by the workflow itself.
  • Parallel execution, management, and monitoring of hundreds of high-load computational tasks.
  • High autoscaling demands due to spiky loads.
  • Fast analytical queries in SQL format over hundreds of GB of semi-structured genomic data.
  • Cost-effective and secure storage of hundreds of TB of data.
  • User-friendly web interface for scientists.

Results

  • The data platform for rapid WGS genotyping, annotation and analysis was delivered.
  • Reads of thousands of patients have been collected, preprocessed, stored, and analyzed.
  • Correlation of genetic factors with disease expression and drug response was studied and contributed to the development of vaccines (under NDA).

Implementation Details

  • Raw reads generated from Illumina sequencers were uploaded to the object storage and processed in a multi-step workflow to identify the variation in a biological sample compared to a standard genome reference. The resulting variants were combined with other information to identify genomic variants highly correlated with the disease and drug response.
  • The workflow included such steps as variant calling (Deep Variant), genome annotation (VEP and LOFTEE), variant classification (CADD and DANN deep learning extension), and phenotype-to-genotype correspondence.
  • Apache Airflow was used for the authoring, scheduling, and monitoring of the workflows. Computational genetics algorithms and environments were containerized (Docker) and pushed to a cloud container registry. Airflow runs a separate compute environment with GPU for each task, monitors the execution status, and automatically collapses the environment as soon as the task is completed.
  • Raw WGS reads (FASTQ) were transformed into structured (VCF, annotated VCF) and loaded into the columnar OLAP data warehouse.
  • Lifecycle policies were implemented for automatic separation of the artifacts into “hot” (used in analytics) and “cold” (not used currently, but may be needed for audit).
  • The web portal was designed, implemented, and deployed to the serverless auto-scalable cloud compute engine.
Get a technical consultation
Alex Gurbych

Alex Gurbych

Chief Solutions Architect

Receive a professional and in-depth consultation from an experienced expert. Get tailored advice to address your specific needs and achieve your goals effectively.