Quark: Enabling Reproducibility in Bioinformatics Workflows

an artist s illustration of artificial intelligence ai this illustration depicts language models which generate text it was created by wes cockx as part of the visualising ai project l

Background

High-throughput technologies like Next-Generation Sequencing (NGS) generate large gigabases of data. However, this data is multivariate and compute-intensive, requiring sophisticated analytic workflows before it can drive drug discovery. Researchers carry out these data analyses in their institution’s computational environment. 

Life science institutes and companies have scattered bioinformatics teams that develop data analysis pipelines independently. This results in different sets of tools and versions being used for similar pipelines, leading to inconsistencies in downstream analysis. The same flexibility that allows researchers to modify their workflows also challenges computational reproducibility. 

Reproducibility Lays the Foundation for Scalable Workflows

Drug discovery pipelines depend on robust and reproducible results from initial data analysis. However, researchers often struggle to replicate their study outcomes, even when using the same bioinformatics workflows. 

Scientists performing downstream data analytics based on published research may be unable to replicate the results because of differences in their computational environment. Even in the same environment, researchers attempting to scale their processes to larger samples encounter inconsistencies in various output parameters. 

Inconsistent outcomes happen because there are multiple complexities involved in each step of pipeline assembly, challenging researchers to develop a structured and consistent approach. Inconsistencies compound into major errors that:

  • undermine the reliability of a workflow,
  • discourage collaborations in research, and;
  • complicate troubleshooting.

Since bioinformatics workflows are multistep and multivariate, like their data, it’s impossible to develop a single one-size-fits-all analytics pipeline. The ideal solution to ensure reproducibility would require a workflow that is both modifiable to a project’s needs, as well as reproducible in any environment.

Quark delivers this solution and does not require user-expertise in coding, bioinformatics, cloud computing or HPC. Following a multi-pronged approach, Quark addresses various risks simultaneously, ensuring scalable and portable workflows that prioritise reproducibility.

Quark: Addressing Challenges in Computational Reanalysis

Manually assembling a bioinformatics pipeline increases the margin of error, since there are multiple considerations to be taken into account. For example, tools such as fastQC will only run on certain operating systems (Linux). Frequent updates also lead to different versions of the same tool, resulting in inconsistent outcomes. Unless a highly painstaking log of all assembly steps are also manually maintained, it becomes impossible to replicate a data analysis workflow. 

Additionally, the same tool run on different scripting languages (Python, R) results in significantly different outcomes in data analysis. Even within the same computational workspace, retrieving the original computational tools, softwares and pipeline versions used to process data may not be possible—this also complicates reproducibility.

Such inconsistencies may not be immediately apparent, raising difficulties in troubleshooting errors in later stages of biomarker validation, thereby increasing cost of drug discovery. 

Reproducibility thus requires best practices which enforce consistency in workflows and coding environments, such as: controlling code version and compute environment, persistent data sharing, literate programming, and documentation. Each best practice applies to every tool in the pipeline, requiring rigorous monitoring and risk assessment. 

Quark delivers a simple and comprehensive solution that integrates all best practices listed above, by addressing reproducibility at two levels:

  1. Pipeline-reproducibility (containerisation and versioning)
  2. Data-reproducibility (consistent documentation and sharing)

1. Quark: Delivering Containerised and Versioned Pipelines

Bioinformatics pipelines are complex, having multiple assembly steps with the following requirements:

  1. specialised tools for data processing and extraction;
  2. computational environment OS (Windows, Linux, Mac) and scripting language (e.g., R, Python, Java), including their appropriate versions to run them;
  3. integration between different tools, and conversion of output/input into compatible formats;
  4. extracting final data in a user-friendly format.

A formidable challenge is the constant updates made to pipelines to accommodate rapidly-evolving tools, creating multiple versions of the same pipeline. This makes the majority of previous coding redundant, and often completely irrelevant, leading to poor workflow reproducibility.

The solution lies in containerisation, where Quark leverages Nextflow pipelines to deliver end-to-end automated workflows on the platform. Nextflow composes pipeline scripts that are containerised to run in any environment (e.g. RNA-Sequencing pipeline). Containerisation enables modularisation of different tools that can be stitched together into complex workflows.

Both bench-scientists and bioinformaticians can access and run Nextflow pipelines end-to-end on Quark.

Quark for Bench-Scientists

Quark provides multiple omics data analysis pipelines from nf-core and AWS HealthOmics out-of-the-box. Quark allows bench-scientists to identify and run pipelines from the platform itself, without requiring prior knowledge of workflows or technical expertise to develop, deploy and run pipelines. 

For example, Quark offers an nf-core RNA-Sequencing pipeline that is fully containerised, versioned and automated. Quark maintains multiple versions of each pipeline seamlessly. The bench scientist can choose which pipeline version they want to use for their secondary analysis.

Users can come back anytime after completion of the pipeline run, to check the pipeline and tool versions, along with datasets used to get results. This simplifies versioning, to ensure reproducibility.

Quark for Bioinformaticians

Quark allows bioinformaticians to build their own workflows, using Quark’s no-code Visual Builder, or Pipeline Builder. Some advantages of using the Pipeline Builder include:

  • New pipelines are built using an intuitive drag-and-drop feature.
  • All available bioinformatics tools are versioned.
  • Workflows can be tailored to a particular input reference data-type/source.
  • Pipelines are reproducible and easily accessible. 
  • Scaling workflows to larger sample sizes is simplified, reducing manual errors.
  • Pipelines are portable, and can be shared with other users on Quark once published in the launchpad menu.

Another way to develop a new reproducible pipeline is the “Import Pipelines” feature in Quark. Bioinformaticians can import any publicly available, or their own pipeline, from git repository using this feature. All git imported pipelines are versioned and can be published in the Quark Launchpad.

Quark: no-code Pipeline Builder. Import Pipelines from git repositories out-of-box
Quark: no-code Visual Pipeline Builder. Build workflows using the intuitive drag-and-drop feature

2. Quark: Empowering Data Documentation and Sharing

Quark simplifies data retrieval, pipeline versioning and code sharing: extremely important factors in maintaining transparency in data science research. In addition to versioning, Quark also maintains detailed logs that enable effective reference data and metadata management. The platform delivers detailed logs of the: 

  • files generated, 
  • pipeline version, 
  • genome reference dataset used, and
  • any other input parameters used while running the pipeline.

Quark: Retrievable Results Pane. Once a pipeline run is complete, secondary analysis reports are retrievable/downloadable. The Pipeline version used for the analysis is already displayed (circled in red)
Quark: Retrievable Input Parameters. Easily retrieve the log of input parameters to replicate the run

Researchers re-running their analysis at a later date can easily retrieve the associated pipeline run logs on Quark’s platform, rather than having to comb through their development environments for details of the softwares installed or repositories used.

Log details also include input data parameters such as: cloud storage path of the raw sequencing file, the reference genome used, and optional parameters used such as pseudo-aligners used in RNA-Seq pipeline run. 

Quark provides MultiQC reports for all pipeline runs which is a user-friendly report that can be later retrieved to find the details of a run (including the version of the pipeline, and individual tools, deployed on that date). Such detailed logs that are maintained and retrieved from a single access-point on the platform, reduce data redundancy and enforce best practices that empower collaborations. 

For example, a researcher may prefer to use the GRCh38 reference genome over GRCh37 or a pre-defined panel-of-normal variant files. All the changes to the default parameters can be made and are saved in a pipeline run log on the researcher’s Quark account. Thus Quark ensures both flexibility and consistency in a researcher’s data analysis pipeline, enabling effective data sharing between different users. 

Conclusion

Reproducibility is often overlooked in bioinformatics because of the multiple complexities involved in diverse steps of data analysis workflows. Quark’s multi-pronged approach addresses each obstacle by prioritising reproducibility and laying the foundation for scalable and reliable workflows. 

Quark’s environment is streamlined to automate end-to-end data analysis, but also offers flexible no-code pipeline features that bioinformaticians can use to build their own workflows. Such personalised changes are formalised, fully retrievable for reruns, and logged–to enable later modifications or troubleshooting. Quark offers multiple solutions within a single platform that, combined, enhance reproducibility in bioinformatics workflows.

Thus, by simplifying the user’s experience of data analysis, researchers can fully focus on experimental validation rather than worry about the veracity and reproducibility of their foundational analysis. 

Request a demo to learn more about Quark.

1 comment

Leave a Reply

Discover more from Quark Bioinformatics Platform

Subscribe now to keep reading and get access to the full archive.

Continue reading