Genome Analysis Data Platform
Business Goals
- Develop a large-scale whole-genome genotyping and annotation platform.
- Achieve end-to-end processing of one sample in hours instead of ordinary days and weeks.
- Ensure under 2 seconds latency for analytical requests.
Challenge
- Hundreds of TB of data to ingest, transform, store, and analyze.
- Tens of conditional steps of the workflow to be managed by the workflow itself.
- Parallel execution, management, and monitoring of hundreds of high-load computational tasks.
- High autoscaling demands due to spiky loads.
- Fast analytical queries in SQL format over hundreds of GB of semi-structured genomic data.
- Cost-effective and secure storage of hundreds of TB of data.
- User-friendly web interface for scientists.
Results
- The data platform for rapid WGS genotyping, annotation and analysis was delivered.
- Reads of thousands of patients have been collected, preprocessed, stored, and analyzed.
- Correlation of genetic factors with disease expression and drug response was studied and contributed to the development of vaccines (under NDA).
Implementation Details
- Raw reads generated from Illumina sequencers were uploaded to the object storage and processed in a multi-step workflow to identify the variation in a biological sample compared to a standard genome reference. The resulting variants were combined with other information to identify genomic variants highly correlated with the disease and drug response.
- The workflow included such steps as variant calling (Deep Variant), genome annotation (VEP and LOFTEE), variant classification (CADD and DANN deep learning extension), and phenotype-to-genotype correspondence.
- Apache Airflow was used for the authoring, scheduling, and monitoring of the workflows. Computational genetics algorithms and environments were containerized (Docker) and pushed to a cloud container registry. Airflow runs a separate compute environment with GPU for each task, monitors the execution status, and automatically collapses the environment as soon as the task is completed.
- Raw WGS reads (FASTQ) were transformed into structured (VCF, annotated VCF) and loaded into the columnar OLAP data warehouse.
- Lifecycle policies were implemented for automatic separation of the artifacts into “hot” (used in analytics) and “cold” (not used currently, but may be needed for audit).
- The web portal was designed, implemented, and deployed to the serverless auto-scalable cloud compute engine.
Industry
Service
Type
- Case Study
Keywords
- Genetics
- Multi-Omics
- Pharmaceuticals and Biotech
Roadmap
/*=
$user_is_authed
? declense_numeral(get_field('duration'), 'month', 'months')
: 'X months';
*/ ?>
Business Goal Validation
Solutions Architect
Solution Design
Solutions Architect
OLAP Warehouse Design
Data Architect, Bioinformatician
Web Service Design
UI/UX Designer, Frontend Architect
Sample Data Collection
Data Architect, Data Engineer
Genotyping Pipeline Development
Data Engineer, Bioinformatician
Annotation Pipeline Development
Data Engineer, Bioinformatician
Workflow Orchestration
Data Engineer
Web Development
Frontend Developer
Backend Development
Backend Developer
Analytics Engine Development
Data Architect, Data Engineer, Bioinformatician
Infrastructure Deployment
DevOps
Testing
Bioinformatician
Bugfix & Updates
Data Engineer, Frontend Developer, Backend Developer
Deployment Automation
DevOps
Setting up CI/CD
DevOps
Documentation & Knowledge Transfer
Release
Sign up to receive the project description
Want to talk?
Michael Gurbych
Director,
Operations and Finance
Operations and Finance
Roadmap
/*=
$user_is_authed
? declense_numeral(get_field('duration'), 'month', 'months')
: 'X months';
*/ ?>
Business Goal Validation
Solutions Architect
Solution Design
Solutions Architect
OLAP Warehouse Design
Data Architect, Bioinformatician
Web Service Design
UI/UX Designer, Frontend Architect
Sample Data Collection
Data Architect, Data Engineer
Genotyping Pipeline Development
Data Engineer, Bioinformatician
Annotation Pipeline Development
Data Engineer, Bioinformatician
Workflow Orchestration
Data Engineer
Web Development
Frontend Developer
Backend Development
Backend Developer
Analytics Engine Development
Data Architect, Data Engineer, Bioinformatician
Infrastructure Deployment
DevOps
Testing
Bioinformatician
Bugfix & Updates
Data Engineer, Frontend Developer, Backend Developer
Deployment Automation
DevOps
Setting up CI/CD
DevOps
Documentation & Knowledge Transfer
Release