Genomics Data Pipeline

A serverless, event-driven AWS pipeline for processing genomics VCF files, transforming them into Parquet, and enabling analytics with Athena, built using Terraform.

📖 Project Overview

The Genomics Data Pipeline is a scalable, secure, and automated AWS-based solution designed to process Variant Call Format (VCF) files, transform them into Parquet format, and make them queryable via AWS Athena. It leverages a serverless architecture with AWS Lambda, Glue, and S3, orchestrated by Terraform for infrastructure-as-code (IaC) deployments. The pipeline validates VCF files, processes them into a structured format, and updates a Glue Data Catalog for analytics.

Key features:

VCF Processing: Validate and transform VCF files into Parquet using AWS Glue.
Event-Driven Trigger: Automatically process files uploaded to S3 via Lambda.
Analytics: Query processed data using Athena with results stored in S3.
Secure Storage: Use KMS-encrypted S3 buckets for raw and processed data.
IaC Automation: Deploy and manage infrastructure with Terraform.

🛠️ Tech Stack

Technology	Purpose
AWS Lambda	Validates VCF files and triggers Glue jobs
AWS Glue	Transforms VCF files to Parquet and updates Data Catalog
AWS Athena	Queries processed genomics data
AWS S3	Stores raw VCF files, processed Parquet files, and Athena query results
AWS KMS	Encrypts S3 buckets for secure storage
AWS IAM	Manages permissions for Lambda, Glue, and Athena
Terraform	Infrastructure as Code for provisioning AWS resources
GitHub	Version control for Terraform configurations

🏗️ Architecture

The pipeline follows a modular, serverless, and event-driven architecture:

Input: S3 bucket (genomics-raw-data-bucket) stores uploaded VCF files and triggers Lambda via S3 notifications.
Processing:
- Lambda Function (validate-vcf): Validates VCF file format and triggers Glue jobs, handling concurrency limits.
- Glue Job (genomics-vcf-etl-job): Reads VCF files, transforms them to Parquet, and partitions by CHROM.
Storage: Processed Parquet files are stored in s3://genomics-processed-data-bucket/processed/.
Cataloging: Glue Crawler (genomics-raw-data-bucket-crawler) updates the Data Catalog (genomics_vcf_database.vcf_data).
Analytics: Athena queries the cataloged data, storing results in s3://genomics-processed-data-bucket/athena-queries/.
Security: KMS encryption secures S3 buckets, and IAM roles restrict access.

📂 Repository Structure

genomics-data-pipeline/
├── terraform/
│    ├── main.tf                   # Root Terraform configuration
│    ├── variables.tf              # Root input variables
│    ├── outputs.tf                # Root output values
│    ├── terraform.tfvars          # Terraform variable definitions
│    ├── s3_notification.tf        # S3 bucket notification configuration
│    ├── modules/                  # Terraform modules
│    │    ├── s3/                  # S3 buckets for input, output, and Glue scripts
│    │    ├── lambda/              # Lambda function for VCF validation
│    │    ├── iam/                 # IAM roles for Lambda and Glue
│    │    ├── glue/                # Glue job and crawler for processing and cataloging
│    │    └── athena/              # Athena database and workgroup
├── README.md                      # Project documentation
├── .gitignore                     # Git ignore rules
├── test.vcf                       # Test VCF file for testing
└── LICENSE                        # MIT License

🚀 Setup Instructions

Follow these steps to set up the project locally:

Prerequisites

AWS Account with programmatic access (Access Key and Secret Key).
Terraform v1.5.7 or later installed (terraform -version).
Git installed (git --version).
AWS CLI v2 installed and configured (aws configure).
Python 3.9 for Lambda and Glue scripts.

Steps

Clone the Repository

Clone the repo locally:

git clone https://github.com/ankurshashwat/genomics-data-pipeline.git
cd genomics-data-pipeline

Configure AWS Credentials
- Set up AWS CLI with your credentials:
```
aws configure
```
  - Provide Access Key, Secret Key, region (us-east-1), and output format (json).
Initialize Terraform
- Initialize the Terraform working directory:
```
terraform init
```

Set Up Terraform Variables

Create a terraform.tfvars file in the terraform/ directory:

aws_region          = "us-east-1"
input_bucket_name   = "genomics-raw-data-bucket"
output_bucket_name  = "genomics-processed-data-bucket"
lambda_function_name = "validate-vcf"
glue_job_name       = "genomics-vcf-etl-job"
glue_crawler_name   = "genomics-raw-data-bucket-crawler"

The bucket names will append a random suffix (e.g., klc6).

Deploy Infrastructure
- Run Terraform plan to preview changes:
```
terraform plan
```
- Apply the configuration:
```
terraform apply --auto-approve
```
- Outputs (e.g., input_bucket_name, output_bucket_name) will be displayed.

Prepare Test Data

Create a test.vcf file in the project root:

echo #CHROM	POS	REF	ALT	QUAL	sample1	sample2 > test.vcf
echo chr1	1002345	A	G	45.7	0/1	1/1 >> test.vcf
echo chr2	2003456	C	T	50.2	0/0	0/1 >> test.vcf

Test the Pipeline

Upload test.vcf to the input bucket:

aws s3 cp test.vcf s3://<input_bucket_name>/test.vcf --region us-east-1

Verify Lambda trigger:

aws logs tail /aws/lambda/validate-vcf --since 10m --region us-east-1

Look for Started Glue job: <JobRunId>.

Check Glue job status:

aws glue get-job-runs --job-name genomics-vcf-etl-job --region us-east-1

Verify Parquet output:

aws s3 ls s3://<output_bucket_name>/processed/ --recursive --region us-east-1

Run Glue crawler:

aws glue start-crawler --name genomics-raw-data-bucket-crawler --region us-east-1

Query data in Athena:

aws athena start-query-execution \
  --query-string "SELECT * FROM genomics_vcf_database.vcf_data LIMIT 10;" \
  --work-group genomics_workgroup \
  --region us-east-1

Check results:

aws athena get-query-results --query-execution-id <QueryExecutionId> --region us-east-1

Clean Up
- Destroy resources to avoid costs:
```
terraform destroy --auto-approve
```

🧪 Testing and Validation

Infrastructure: Verified S3 buckets, Lambda, Glue job, crawler, and Athena workgroup in AWS Console.
Lambda Trigger: Confirmed S3 notifications trigger validate-vcf Lambda function.
Glue Processing: Validated Parquet output for test.vcf in s3://<output_bucket_name>/processed/.
Athena Queries: Executed queries in genomics_workgroup and verified results in s3://<output_bucket_name>/athena-queries/.
Security: Ensured KMS encryption and IAM roles restrict access appropriately.

📚 Lessons Learned

Serverless Pipelines: Mastered event-driven workflows with S3, Lambda, and Glue.
Terraform Modules: Structured reusable modules for S3, Lambda, Glue, and Athena.
Concurrency Management: Implemented Lambda logic to handle Glue job concurrency limits.
Athena Integration: Learned to configure query result locations and bucket policies for analytics.

🚧 Future Improvements

Add SNS notifications for pipeline status updates.
Implement CloudWatch monitoring for Lambda and Glue job metrics.
Support larger VCF files with Glue job optimization.
Integrate QuickSight for data visualization.

🤝 Contributing

Contributions are welcome! Please:

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit changes with descriptive messages.
Push to the branch (git push origin feature/your-feature).
Open a pull request.

📬 Contact

Author: ankurshashwat
Email: ankurshwt@gmail.com
LinkedIn: ankurshashwat

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genomics Data Pipeline

📖 Project Overview

🛠️ Tech Stack

🏗️ Architecture

📂 Repository Structure

🚀 Setup Instructions

Prerequisites

Steps

🧪 Testing and Validation

📚 Lessons Learned

🚧 Future Improvements

🤝 Contributing

📬 Contact

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
genomics-data-pipeline-architecture.PNG		genomics-data-pipeline-architecture.PNG
test.vcf		test.vcf

License

ankurshashwat/genomics-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Genomics Data Pipeline

📖 Project Overview

🛠️ Tech Stack

🏗️ Architecture

📂 Repository Structure

🚀 Setup Instructions

Prerequisites

Steps

🧪 Testing and Validation

📚 Lessons Learned

🚧 Future Improvements

🤝 Contributing

📬 Contact

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages