Project image

Business Goals

  • Develop a large-scale whole-genome genotyping and annotation platform.
  • Achieve end-to-end processing of one sample in hours instead of ordinary days and weeks.
  • Ensure under 2 seconds latency for analytical requests.

Challenge

  • Hundreds of TB of data to ingest, transform, store, and analyze.
  • Tens of conditional steps of the workflow to be managed by the workflow itself.
  • Parallel execution, management, and monitoring of hundreds of high-load computational tasks.
  • High autoscaling demands due to spiky loads.
  • Fast analytical queries in SQL format over hundreds of GB of semi-structured genomic data.
  • Cost-effective and secure storage of hundreds of TB of data.
  • User-friendly web interface for scientists.

Results

  • The data platform for rapid WGS genotyping, annotation and analysis was delivered.
  • Reads of thousands of patients have been collected, preprocessed, stored, and analyzed.
  • Correlation of genetic factors with disease expression and drug response was studied and contributed to the development of vaccines (under NDA).

Implementation Details

  • Raw reads generated from Illumina sequencers were uploaded to the object storage and processed in a multi-step workflow to identify the variation in a biological sample compared to a standard genome reference. The resulting variants were combined with other information to identify genomic variants highly correlated with the disease and drug response.
  • The workflow included such steps as variant calling (Deep Variant), genome annotation (VEP and LOFTEE), variant classification (CADD and DANN deep learning extension), and phenotype-to-genotype correspondence.
  • Apache Airflow was used for the authoring, scheduling, and monitoring of the workflows. Computational genetics algorithms and environments were containerized (Docker) and pushed to a cloud container registry. Airflow runs a separate compute environment with GPU for each task, monitors the execution status, and automatically collapses the environment as soon as the task is completed.
  • Raw WGS reads (FASTQ) were transformed into structured (VCF, annotated VCF) and loaded into the columnar OLAP data warehouse.
  • Lifecycle policies were implemented for automatic separation of the artifacts into “hot” (used in analytics) and “cold” (not used currently, but may be needed for audit).
  • The web portal was designed, implemented, and deployed to the serverless auto-scalable cloud compute engine.

Type

  • Case Study

Keywords

  • Genetics
  • Multi-Omics
  • Pharmaceuticals and Biotech
Roadmap
Business Goal Validation
Solutions Architect
Solution Design
Solutions Architect
OLAP Warehouse Design
Data Architect, Bioinformatician
Web Service Design
UI/UX Designer, Frontend Architect
Sample Data Collection
Data Architect, Data Engineer
Genotyping Pipeline Development
Data Engineer, Bioinformatician
Annotation Pipeline Development
Data Engineer, Bioinformatician
Workflow Orchestration
Data Engineer
Web Development
Frontend Developer
Backend Development
Backend Developer
Analytics Engine Development
Data Architect, Data Engineer, Bioinformatician
Infrastructure Deployment
DevOps
Testing
Bioinformatician
Bugfix & Updates
Data Engineer, Frontend Developer, Backend Developer
Deployment Automation
DevOps
Setting up CI/CD
DevOps
Documentation & Knowledge Transfer
Release

Sign up to receive the project description

    Roadmap
    Business Goal Validation
    Solutions Architect
    Solution Design
    Solutions Architect
    OLAP Warehouse Design
    Data Architect, Bioinformatician
    Web Service Design
    UI/UX Designer, Frontend Architect
    Sample Data Collection
    Data Architect, Data Engineer
    Genotyping Pipeline Development
    Data Engineer, Bioinformatician
    Annotation Pipeline Development
    Data Engineer, Bioinformatician
    Workflow Orchestration
    Data Engineer
    Web Development
    Frontend Developer
    Backend Development
    Backend Developer
    Analytics Engine Development
    Data Architect, Data Engineer, Bioinformatician
    Infrastructure Deployment
    DevOps
    Testing
    Bioinformatician
    Bugfix & Updates
    Data Engineer, Frontend Developer, Backend Developer
    Deployment Automation
    DevOps
    Setting up CI/CD
    DevOps
    Documentation & Knowledge Transfer
    Release