From Pipeline to Product: Building a Genomic Data App on AWS
Introduction
Building on our previous articles—Genomics on AWS and Mapping Out Genomic Variation—this post shows how we turned a genomic pipeline into a simple web app. Users upload a VCF file, trigger a cloud-based analysis with Nextflow, and receive an email with a link to view and download their results. We explain how this was built to make complex bioinformatics workflows accessible and production-ready.
Make sure to check out the tool here!
What is Nextflow?
Nextflow is a workflow orchestration tool tailored for scalable and reproducible data analysis. Widely used in the bioinformatics community, it enables the automation of complex pipelines across different computational environments—whether local machines, high-performance clusters, or the cloud. In our application, Nextflow serves as the engine that runs the genomic analysis using the uploaded VCF file, making it an ideal fit for cloud-based, data-driven workflows.
What is a VCF File?
A VCF (Variant Call Format) file is a standard text file used in bioinformatics to store gene sequence variations, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants. Each entry in a VCF file describes a position in the genome and the variant observed at that location. This format is widely used in genomic studies and clinical research to represent individual or population-level genetic differences, making it a common input for downstream analysis pipelines like the one in our application.
Pipeline results
The results consists of the same type of graphs presented in the Conclusion section of Mapping Out Genomic Variation article.
Architecture & Tech Stack
The application is built as a modern, serverless web platform. The front end is developed using React, styled with TailwindCSS, and bundled with Vite. It interacts with a backend API hosted on AWS API Gateway, with business logic handled by Python-based AWS Lambda functions.
User-uploaded VCF files are stored in Amazon S3, while job metadata and user information are persisted in Amazon DynamoDB. The genomic pipeline itself is orchestrated using Nextflow, which runs on AWS Batch. Once a job is completed, an email notification is sent to the user using Amazon SES, including a link to view the job details and download the results.
User Workflow
The application is designed to guide the user through a smooth, hands-off experience from file upload to result delivery. Here’s how the workflow unfolds:
-
Drag & Drop: The user begins by dragging and dropping a VCF file into the web application’s interface.
-
File Upload Preparation: The frontend requests a pre-signed URL from the backend via an API call, allowing the VCF file to be securely uploaded directly to Amazon S3.
-
File Upload: Using the pre-signed URL, the file is uploaded to S3.
-
Job Submission: After the upload completes, the frontend makes a second API call to initiate the processing job. The backend responds with a unique job ID (UUIDv4) that can be used to track the job.
-
Starting the job: The backend sends a remote command via AWS SSM to an EC2 instance with Nextflow installed. This command instructs the instance to start a new job on AWS Batch to run the Nextflow pipeline.
-
Pipeline Execution: The Nextflow pipeline is executed on AWS Batch, using the uploaded VCF file, which is stored in the S3 bucket, as input.
-
Result Handling: Once the pipeline finishes, the output report is stored back in S3.
-
User Notification: The backend uses Amazon SES to send an email to the user, including a link to a job detail page on the frontend. From this page, the user can view job metadata and download the results via a pre-signed S3 URL.
This streamlined process ensures that users don’t need to interact with any backend infrastructure directly—they simply upload a file and receive an email once the job is complete.
Why use an EC2 instance to trigger the Nextflow pipeline?
The answer to this question relates to the 15-minute limit imposed by AWS Lambda on running code. As mentioned earlier, the backend consists of AWS Lambda functions. If the pipeline were triggered as part of the Lambda handler, it would need to complete within 15 minutes — even though it runs remotely on AWS Batch. Exceeding this limit would cause the Nextflow process to be terminated, which in turn would cancel the associated AWS Batch job. As a result, the pipeline would not be able to complete successfully.
By using a long-running EC2 instance, we can work around this limitation. Another advantage of triggering the pipeline from the EC2 instance is that it allows us to run arbitrary code after the Nextflow process completes — for example, sending an email to the user with the results.
Security and rate-limiting
One of the requirements for submitting a VCF file is being subscribed to the eCellula newsletter. In addition, each subscribed user can have at most one job in progress at a time. This means that you cannot process multiple VCF files in parallel using the same email address. This restriction is intended to rate-limit users who might otherwise abuse the system.
Conclusion
Turning pipelines into usable products is about more than just infrastructure—it’s about workflow design. If you’re ready to build tools around your bioinformatics data, let’s talk.