Amazon Genomics CLI is a tool that makes it easier to process genomics data at a petabyte-scale on AWS. Earlier this year, the public cloud vendor shared a preview of the tool, and it is now open source and generally available. The company’s goal with the Genomics CLI is to remove the heavy lifting from setting up and running genomics workflows in the cloud by allowing software developers and researchers to automatically provision, configure and scale cloud resources.
In general, the sequencing of genomes generates a lot of data. For instance, the human genome is composed of over 3 billion letters of code. Therefore, analyzing the sequence for one or more persons to track infectious diseases, food pathogens, and toxins requires various tools to be orchestrated as a specific sequence of steps or a workflow. Genomics and bioinformatics communities have developed specialized workflow definition languages like WDL and Snakemake, yet struggle with the massive amount of data processing – that requires scaling of infrastructure like compute and storage.
With Amazon Genomics CLI, users can run workflows written in a language like WDL with optimized infrastructure on AWS. Furthermore, the workflows are executed in one or more so-called contexts. Danilo Poccia, chief evangelist (EMEA) at Amazon Web Services, explain in an AWS blog post on the Amazon Genomics CLI the concept of a context as follows:
A context encapsulates and automates time-consuming tasks to configure and deploy workflow engines, create data access policies, and tune compute clusters (managed using AWS Batch) for operation at scale.
A user can install the Amazon Genomics CLI on their laptop and activate it with their AWS account. Subsequently, the core infrastructure that Amazon Genomics CLI needs to operate is created, including an S3 bucket, a virtual private cloud (VPC), and a DynamoDB table. Note that the VPC can also be an existing one. Next, the user can create a project or leverage a sample project from the CLI installation. A project (YAML file) in the Amazon Genomics CLI links workflows, datasets, and the contexts used to process them. Once a project is deployed and contexts are ready, the user can execute the workflows. The results during the execution are stored in the S3 buckets and can be explored using the Amazon Genomics CLI.
Let’s prepare #hacktoberfest2021 by having a look at this cool AWS Open Source project: Amazon Genomics CLI. It’s a way to scale analysis of genomics & biological data to study biodiversity, drug discovery, health issues using the power of AWS resources!
Currently, Amazon Genomics CLI is available in all AWS Regions except for AWS GovCloud (US) and Regions located in China. Furthermore, customers only pay for the AWS resources created by the CLI. And lastly, more details of the Amazon Genomics CLI are available in the documentation.