Quark: Democratizing Access to RNA Sequencing Data Analytics

RNA Sequencing in Early Drug Discovery

We are in the post-genomics era, where sequencing technologies are routinely used to develop high-resolution snapshots of a disease state at the molecular level. DNA/RNA Sequencing is a crucial feature of profiling a disease state, allowing researchers to delineate genes/variants (DNA-Seq.) and gene expression patterns (RNA-Seq.) related to a disease when compared against healthy or control samples.

The early phase of a drug discovery pipeline targets biomarker identification and validation. This requires complex multiomics data analysis, which lays the foundation for later drug discovery phases and ensures successful clinical trials.

Increasing affordability of DNA and RNA sequencing is transforming drug discovery pipelines, paving the way to unprecedented amassment of data (upto peta- or exabytes) in warehouses. The simultaneous explosion in big data and AI/ML-based data analytics has accelerated drug discovery pipelines, enabling researchers to streamline drug targets and generate leads entirely in silico. For example, a generative AI platform, the NVIDIA BioNemo Drug Discovery platform, recently discovered and generated an entirely new drug molecule, which is currently in clinical trials.

Bottlenecks in Early Drug Discovery

The rapid advancement of computational approaches comes with a drawback. Scientists are required to have expertise in cloud computing, HPC, bioinformatics, or a large team of software engineers and bioinformaticians to analyse their large-scale omics data, raising several issues that:

preclude direct insights researchers can gain from early exploratory analysis of their samples;
prolong data analyses workflows and timelines;
potentially compromise reproducibility, data privacy and security due to multiple handovers, and;
increase the risk of drug attrition.

Other bottlenecks pose critical problems, delaying project timelines for the researcher required to analyze their own data. The multitude of bioinformatic workflows and softwares needed for every step of DNA/RNA Seq analysis further complicates research workflows. It also increases the scope of error in deploying appropriate tools for a study’s unique requirements.

Furthermore, the vast availability and evolution of data analytics tools risks the reproducibility of data analyses. Additionally, pipelines and tools undergo frequent updates or replacements. Data retrieval and analysis is further complicated by the usage of multiple tools and workspaces.

For these reasons, the complexity and inaccessibility of data analysis workflows causes delays, precludes potentially actionable data-driven insights, and increases the margin of error in biomarker discovery and validation. Errors in the early stages of drug discovery are crucial rate limiters that cause ineffectual scaling of drug discovery pipelines.

Quark: Making Complex Data Analytics Accessible

Using bioinformatics workflows, gene expression patterns related to a disease condition can be identified from RNA Seq. data. The goal is to understand relative abundance of gene expression levels from RNA Seq. data, which can be visualised, interpreted, and functionally profiled across different datasets or cohorts to identify differentially expressed genes (biomarker identification in early drug discovery).

This goal of accelerating data-driven insights into what molecular factors cause a disease condition may be inaccessible to many researchers, due to the complexity of analytical pipelines that requires expertise in bioinformatics and coding. The diversity of tools available for processing RNA Seq. data may also pose a challenge, which can contribute to low reproducibility of exploratory data analysis because of inconsistent workflows.

We strongly believe that scientists should be able to focus their time in analysing their data and getting actionable insights, rather than worrying about compute, storage, scalability, reproducibility and technical know-how for analysing large-scale data.

Bridging this gap in accessibility, Quark provides a self-service scalable, reproducible and secure platform to perform RNA Seq. secondary and tertiary analysis without any technical expertise. Following the Quark Workflow outlined below, researchers can independently access and conduct complex analytics: by simply uploading their sample data-sheets, choosing relevant parameters, and running the pipeline.

The figure below is an overview of the RNA Seq. pipeline (nf-core/rna seq) that converts raw RNA Seq. files to a gene expression matrix (and QC reports). The nf-core/rna seq is a modular approach that integrates RNA Seq. workflow steps into a single pipeline.

We present a step-by-step outline for researchers using Quark for their secondary and tertiary RNA Seq. analysis.

1: Login to Quark

Login with the registered Quark credentials. The platform looks as shown [Fig.1], with the left tab listing ‘Pipelines’; ‘Workspaces’; ‘Apps’; ‘Search’; ‘Files’; ‘Analytics’.

Select ‘Pipelines’. The window open the ‘Dashboard’ which summarises the status of all pipelines run/running on the platform.

**Fig.1: Dashboard displaying run status of pipelines on the platform.**

2: Launch RNA-Seq.

Click the ‘Launchpad’ tab under ‘Pipelines’. A complete list of all workflows available on the platform is displayed.

Use the search tab to find the RNA-Seq. pipeline that will import and analyse the uploaded data [Fig.2]. “Run” the RNA-Seq. pipeline.

**Fig.2: Pipelines Launchpad. Use ‘Search’ tab to find the RNA Seq pipeline.**

3: Running the RNA Seq. pipeline

Name the analysis. This can be a comparison between two cohorts that need to be compared in tertiary analysis (e.g., cohort_name vs. control/comparator)

The example below is a comparison between patient cohort 1 (UHR) vs. patient cohort 2 (BHR).

**Fig.3: Input parameters required to run the RNA Seq. pipeline.**

Upload the sample data file (select the RNA Seq. csv files from the directory).

Select the Genome, Reference Genome, and alignment parameters for RNA Seq. secondary data analysis, from the drop-down menus (the example above uses the GRCh37 genome, and the salmon pseudo-alignment tool to calculate gene expression/transcript abundance.

4: Review pipeline parameters and run analysis

“Submit” after reviewing Run parameters:

**Fig.4: Review and Run the pipeline by clicking ‘Submit’.**

5: Click ‘Runs’ tab under pipelines to search and retrieve

Once the pipeline run completes (that status can be checked on the ‘Dashboard’), retrieve results by going to the ‘Runs’ tab. Search for the name given for the workflow analysis.

The window displays a summary of the secondary data analysis.

**Fig.5: Summary of secondary RNA Seq. data analysis.**

The ‘Results’ tab allows researchers to view and download their output files [Fig.6] to review sample alignment, statistics, and QC reports.

**Fig.6: Result files from secondary RNA Seq. analysis.**

Input and output data parameters can be reviewed [Fig.7].

**Fig.7: Review of data input and output from secondary RNA Seq. analysis.**

Once the pipeline has completed the run, results are accessed and analysed in the ‘Analytics’ tab. The goal is to build cohorts that can be compared to assess differential gene expression (DGE).

**Fig.8: Analytics dashboard displaying list of cohorts and completed analyses.**

To add a new cohort for comparison, select ‘Add New’ in the top right corner.

Load the results of the run from the Analytics window, by selecting the ‘RNA-Seq.’ pipeline, and input a comma-separated list of genes (e.g. BRCA1) to stratify patients based on the gene of interest.

**Fig. 9: Patient stratification based on genes of interest.**

Patient cohorts can also be built to identify biomarkers specific to a genotypic or phenotypic trait (such as in a treated vs. untreated cohort or normal vs. tumour samples).

**Fig.10 a: Building cohorts from the sample data.**

To create a cohort, for example, of treated patient samples, sort the sample data to find patients of interest [Fig. 10 b]. Create and name the cohort (e.g. UHR)

**Fig.10 b: Building cohorts from the sample data.**

Once cohorts have been created, comparison is done on the ‘Analytics’ page. Navigate to the ‘Cohorts’ tab, select the cohorts of interest that have been created (for example UHR and BHR) and run analytics by selecting ‘+’ to compare (UHR vs. BHR)

**Fig. 11 a: Cohort comparison to determine differential gene expression**

**Fig. 11 b: Cohort comparison to determine differential gene expression**

When the comparison completes running, results of the analytics may be accessed from the ‘Analyses’ tab. Click the chart icon of the comparison to conduct exploratory data analysis.

**Fig. 12: Access the comparison and select the chart icon to run analytics.**

Exploratory Data Analysis

The analytical results will look as follows. The first ‘Overlap’ tab is a Venn that shows the number of overlaps between two cohorts. Ensure that there are no overlaps between cohorts before proceeding to principal component analysis (PCA).

**Fig.13: Venn diagram ensures no patient overlap between selected cohorts.**

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an ML-based approach that devolves high-dimensional datasets into lower dimensions (or principal components), so that patterns lost in multidimensional datasets are made visible.

PCA plots are a crucial data analysis step that corrects for internal sample variance, highlighting: outliers, data clusters and other unusual data trends that might otherwise be lost in large data.

**Fig.14: PCA analysis of UHR vs BHR cohorts. The figure illustrates a 3-component PCA plot.**

PCA plots are made to account for sample variance, where resolving data into 2-3 principal components should generally explain >50-70% of data variance. Increasing the number of PCAs increases the noise in identifying variance (which, in RNA Seq. analysis, correlates to underestimated DGE.)

For example, gene expression data can form clusters that will not be visible in a cohort comparison that includes 10,000s of genes. By resolving the data into its principal components, it is corrected/adjusted for internal sample variance so that ‘true’ differential expressions become visible.

Unusual data patterns include data clusters. Similarly, other data patterns (like outliers, jumps) are made visible in the PCA analysis.

On Quark, the number of PCs available varies from 2–6. Ideally, PCA should account for 50-70% of variance, so that true data trends can be made more visible. In the example above, 2-3 principal components appear best suited for this data, accounting for 70-80% of sample variance [Fig.14 graph].

PCAs can later be adjusted based on DGE plots (volcano plots), through a visual examination of data clustering/outliers.

Differential Gene Expression (DGE)

The crux of RNA Seq. analytical pipelines is DGE and Enrichment Analysis– where researchers can draw insights from how expression patterns vary between two different cohorts. DGE accomplishes two goals:

identifies the gene(s) differentially expressed and associated with the disease condition;
provides statistical significance to validate whether the gene(s) identified is relevant or not.

On Quark, the comparative distribution of gene expression between two cohorts is visualised as a volcano plot. Volcano plots are statistical tools that quantify differences in expression fold changes, or logarithmic ratio of abundance, between two cohorts/samples.

Significance values or q-values are calculated based on the distribution. The log fold-changes are calculated, then plotted against q-values to visualise gene expression data as a volcano plot [Fig.15].

**Fig.15: Volcano plot of UHR vs BHR cohorts. The table above lists (and provides significance values) for genes that are identified to be differentially expressed in the two cohorts.**

Quark provides a dynamic way to change the log2 fold change thresholds. A table lists genes that are significantly differentially expressed in the test cohort, compared against the control arm.

This functional profiling analysis enables researchers to gain early insights into:

identifying target genes/biomarkers of interest;
quantifying the magnitude of their fold-change, and;
assessing the significance of the differences.

Heatmap

The heatmap tab provides a list of genes, and depicts their expression levels in individual samples represented in the cohorts. This allows a direct comparison of the different expression levels between the genes of interest.

Since the number of genes examined runs to 1000s, the search tab allows users to filter to the top genes with highest variance (for example, top 30 or 50 genes with significant differences in heatmap distribution between cohorts).

Fig.16: Heatmap of individual gene expression between samples. The figure shows gene clade distribution between three replicates each of the UHR and BHR cohorts. The color distribution indicates expression levels above normalised sample distribution.

Enrichment Analysis

Finally, the ‘Enrichment Analysis’ tab allows researchers to identify the diseases, Molecular Pathways and Gene Ontologies (GO) that are enriched in their differentially expressed gene sets.

For example, if a specific signalling pathway associated with inflammation is enriched, or genes related to the transcription of a specific protein associated with tumorigenesis is over-represented, researchers can easily visualise and download the enriched GO terms to draw further insights about their samples.

**Fig.17: GO terms enriched for the cohort UHR vs. BHR**

Select the database associated with the querying sample in the drop-down menu.

This tab integrates data from different databases, such as the Kyoto Encyclopaedia of Genes and Genomes Pathway database (KEGG Pathway database) and Gene Set Enrichment Analysis (GSEA) database.

Genes are clustered according to their ontologies and the enrichment analysis displays significant differences in gene expression classifications between the two cohorts.

Conclusion

RNA Sequencing, combined with the computational power of bioinformatics workflows, has revolutionised drug discovery pipelines by enabling researchers to mine large data volumes. Large data analysis lays the foundation for robust data validation and provides early insights into results that would otherwise be buried in the ‘noise’ of big data. Computational and statistical visualisation tools continue to increase the sensitivity of data analysis, exponentially increasing the transformational power of big data.

However, harnessing this transformational power remains a challenge to researchers. Scientists wishing to draw direct and validated insights from their own data, are either delayed or do not have access due to the complexity of bioinformatics pipelines and the paucity of infrastructure that limits access to compute-intensive workflows.

With the advent of cloud computing services, the challenge that infrastructure poses has been largely addressed, but requires institutional collaborations. Quark offers a self-service platform that can be readily accessed by researchers to seamlessly get insights from their genomics data. Without having to set-up sophisticated infrastructure or worry about compatibility issues, researchers can focus entirely on data analysis and accelerate experimental validation/lead discovery from their data.

Quark offers a simplified solution by providing a modular workflow design that researchers can deploy for RNA Seq. data analysis, without prior knowledge or expertise in bioinformatics. Quark empowers researchers to draw actionable insights from their data in early exploratory analyses, thus accelerating drug discovery through simplified accessibility of complex resources.

Schedule a demo with us to learn more about Quark’s data analytics platform.

Quark: Democratizing Access to RNA Sequencing Data Analytics

RNA Sequencing in Early Drug Discovery

Bottlenecks in Early Drug Discovery