The integration of single-cell sequencing datasets across experimental batches, donors, or conditions, is often an important step in transcriptomics workflows. Integrative analysis helps to match shared cell types and states across datasets, which can facilitate accurate comparative analysis. Quark enables researchers to horizontally integrate their data using an end-to-end analysis workflow, accelerating biological insights for biomarker discovery.
Introduction: What is single cell transcriptomics?
Single-cell RNA sequencing (scRNA-seq) is a multi-omics approach that quantifies the transcriptomics state of an individual cell at high resolution, enabling researchers to profile and cluster cells based on their gene expression patterns, cell lineage, and cellular heterogeneity.
Single-cell technologies have revolutionized multi-omics research by capturing cellular heterogeneity, which is pivotal in providing mechanistic insights for complex diseases like cancer.
Cellular heterogeneity refers to the inherent biological differences between cells of the same type. It is especially observable during dynamic processes like the cell cycle, embryonic development, cell signaling, disease pathogenesis, and cancer.
Capturing cellular heterogeneity helps in delineating cancer subpopulations and mapping the tumor microenvironment (TME). Therefore, in a fully functional workflow, scRNA-seq has the ability to provide high-resolution, sensitive, and precise insights into cancer biology, by:
- characterising the tumour microenvironment (TME);
- determining cancer evolution;
- capturing tumour heterogeneity, and;
- defining cancer predisposing states.
scRNA-seq has reshaped cancer treatment pipelines, by helping researchers to:
- stratify patients at the molecular level;
- validate drug discovery targets;
- monitor and predict therapeutic trajectories, and;
- determine clinical end-points at a cellular level.
When combined with AI/ML data analytics, scRNA-seq and other Next Generation Sequencing (NGS) techniques accelerate precision medicine and predictive drug discovery, thus vastly improving patient outcomes. This was evidenced in an oncology trial, where scientists leveraged the high sensitivity of scRNA-seq to monitor the response of 226 cancer patients to chemotherapy.
scRNA-seq allowed them to capture early-indicators of treatment resistance at the molecular level. As a result, cancer treatment strategies could be modified in real-time.
This clinical study offered an exciting glimpse into the transforming potential of single-cell transcriptomics. However, navigating the complex landscape of scRNA-seq workflows presents a formidable challenge to many bench-scientists, impacting its scalability and application in clinical settings.
scRNA-seq data analysis workflows remain vastly inaccessible, leading to poor transformational power of both clinical and multi-omics data assets.
Challenges in scRNA-seq Data Analysis
scRNA sequencing is superior to bulk RNA sequencing in its resolution power and precision in capturing the transcriptomics state of a cell. Where bulk RNA-seq averages gene expression across a mixed population of cells, scRNA-seq captures a single cell’s transcriptomics state, improving precision.
However, the same precision and high-resolution capacity of scRNA-seq gathers large volumes of ‘noisy’ data, presenting numerous challenges to the scRNA-seq workflow. Some of these are highlighted below.
- scRNA-seq samples require rigorous quality control due to batch effects, challenging the design of end-to-end data analysis workflows
scRNA-seq involves processing cells from different samples across multiple runs. These runs may have used, for example, different methods to collect scRNA-seq data (droplet-based or microfluidic-based methods), or the samples may be from different donors and conditions.
These differences introduce batch-effect variations in gene expression, leading to the clustering of cells by technical origin rather than biological origin i.e. clusters may appear as distinct cell types even when they are actually the same.
Therefore, additional steps of rigorous quality control and batch correction are necessary to harmonize data and reveal underlying biological patterns.
Several benchmarking studies have compared bioinformatics software and tools available for batch correction and quality control.
However, selecting the appropriate tool complicates the design of end-to-end data analysis workflows, challenging bench-scientists who may not have bioinformatics expertise, and delaying scRNA-seq data analysis.
- scRNA-seq data analysis includes several time-consuming steps like quality control, dimensionality reduction, clustering, and cell annotation, affecting reproducibility
scRNA-seq captures relatively low RNA input from a single cell compared to bulk RNA-seq. Thus, low-abundance transcripts may not be captured as read-outs, causing what is known as the ‘drop-out effect,’ and sparse expression matrices.
This makes it difficult to distinguish true biological zeros (genes not expressed) from technical zeros (genes expressed but not detected).
The development of various imputation methods overcomes the drop-out effect, but selecting the right imputation method presents a challenge to those unfamiliar with scRNA-seq data and introduces a high degree of variability in the data analysis workflow.
Similarly, each subsequent step in downstream analysis, including batch correction and dimensionality reduction (covered below), have various tools and algorithms that address them.
However, there’s a lack of consensus guidelines in the appropriateness of tool selection, which can affect reproducibility of outcomes and their interpretations.
scRNA-seq data analysis: Steps in QC and Downstream Analysis (figure from Kim et al., 2024)
- The high dimensionality of scRNA-seq data poses a challenge in accurate data interpretation
scRNA-seq data captures cellular heterogeneity and is therefore highly dimensional, covering tens of thousands of genes across thousands to millions of cells.
Techniques like principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) help visualize data and identify cell populations in a lower-dimensional space.
But interpreting the output from a static and non-interactive visualisation is challenging and time-consuming, especially if the researcher is unfamiliar with the working principles of dimensionality reduction and clustering algorithms.
- scRNA-seq data analysis presents cost-prohibitive computational challenges
As explained above, scRNA-seq data and data analysis is complex, time-consuming and high-dimensional, requiring sophisticated compute resources like High Performance Computing (HPC), cloud, Kubernetes, and others. These resources may not be readily accessible to many bench-scientists.
Scaling scRNA-seq data analysis for multiple samples (for data integration) further increases computational burden and requires IT expertise to allocate resources appropriately.
Thus, though scRNA-seq has transformed our understanding of cellular heterogeneity, the unique characteristics of its data present numerous analytical hurdles that challenge the majority of bench-scientists and researchers to extract meaningful biological insights in a timely manner.
Overcoming Challenges in scRNA-seq Analytics
The manifold technical challenges in scRNA-seq downstream analysis can be attributed to two major reasons: the stochasticity of gene expression (causing “drop-outs”), and to batch-effects.
A successful approach that addresses both these challenges simultaneously is what is known as horizontal integration, where multiple scRNA-seq data from different experiments or conditions are combined, and their data is reconstructed in a shared space.
Pioneered by Haghverdi et al., horizontal integration ‘de-noises’ data by establishing connections between datasets and aligning them into a common, integrated space.
Horizontal Integration: Denoising technical variability (figure from Jackson et al., 2022)
Advantages of Horizontal Integration in scRNA-seq data analysis
- Batch effects correction: Horizontal integration reduces technical noise or batch effects. As shown in the figure above, multiple datasets are combined and de-noised, following which the remaining variability is used to identify cell-types and gene expression differences.
- Increased statistical power: By combining multiple datasets, the statistical power to detect drop-outs, subtle gene expression changes, and rare cell-types increases. This allows for a more comprehensive and robust interpretation of results.
- Consistent identification of conserved patterns: Horizontal integration can also help to identify and define cell types consistently across different datasets, even if some cell types are rare in individual datasets. It also helps discover conserved patterns of gene expressions across datasets.
- Direct data comparison: Most importantly, by combining datasets into a single shared space, researchers can directly compare datasets from different conditions (e.g., normal vs. disease cells, treated vs. untreated cells, and so on.)
Horizontal integration thus accelerates results interpretation and disease insights, with the added advantage of de-noising data.
To summarize, horizontal integration creates a normalized and de-noised gene expression matrix which can be processed further for downstream analysis (for dimensionality reduction, cell clustering and annotation).
The integrated data can be visualized more effectively in lower-dimensional visualizations (e.g., PCA, UMAP, t-SNE), revealing true biological relationships between cells that might otherwise be obscured.
Horizontal integration thus enables researchers to pool information from various scRNA-seq experiments, leading to a more comprehensive, accurate, and biologically meaningful understanding of cellular heterogeneity.
Use-Case: Horizontal Integration on Quark
Quark enables researchers to use end-to-end horizontal integration for direct comparison of datasets from different scRNA-seq experiments/ donors/ conditions.
Key features of scRNA-seq integrative analysis on Quark
- End-to-end data ingestion, integration and analysis that encompasses all steps from generation of count matrix, batch correction, normalization, scaling, dimensionality reduction, clustering, to differential gene expression analysis
- Easily accessible MultiQC reports from a single dashboard
- Automated scaling of compute resources based on the input files
- Interactive no-code visualizations for UMAP and t-SNE plots for effective and accelerated exploratory data analysis
Use-case illustrating scRNA-seq horizontal integration on Quark
The following use-case illustrates the scope and scale of integrative data analysis on Quark. Sample data was obtained from Lambrechts et al., 2018, a single-cell study that characterizes the tumor microenvironment (TME) of patients with lung carcinoma.
The use-case includes three different datasets from patients with lung carcinoma, where:
- Batch 1 (BT1294) refers to the normal non-tumor patient sample
- Batch 2 (BT1290) refers to the patient sample (from core lung tissue)
- Batch 3 (BT1298) refers to the patient sample (from lung TME)
Horizontal Integration on Quark: Patient Characteristics Overview
The samples were analysed on Quark using nfcore/scrnaseq, with Harmony for scRNA-seq data integration.
Harmony uses iterative clustering in the PCA space to integrate data, and has the advantage of:
- being fast,
- handling large datasets effectively,
- preserving global structure, and;
- avoiding the computationally expensive nearest neighbor searches that MNN-based methods (mutual nearest neighbors) like Seurat require.
In addition, Harmony applies linear correction to PCA embeddings and uses soft k-means clustering with correction factors like diversity penalty, which ensures clusters contain cells from multiple batches.
Following the completion of horizontal integration, researchers can leverage Quark’s in-built, no-code analytics to directly visualize the results of their tertiary data analysis from a single dashboard.
Exploratory data analysis on Quark
On Quark, researchers may navigate to their results and select between different clustering visualisations (UMAP, t-SNE, PCA) from a single dashboard.
Horizontal integration on Quark: Access exploratory insights from a single dashboard
Researchers can select between different views (Cluster, Cohort, Cell Type and Batch), to compare cell cluster data or directly visualise differences between their sample datasets.
For example, the three use-case datasets specified earlier cluster into multiple cell types, as shown in the figure below. Cancer subpopulations may not be directly apparent, but can be visualised using up to four marker genes (e.g., EPCAM).
Horizontal integration on Quark: UMAP cell type view of integrated datasets
The above figure clusters cells based on their cell-type (macrophages, endothelial cells, alveolar fibroblasts, CD4 T cells, CD8 T cells, and so on).
In contrast, in the following Batch view of the integrated datasets, sample variations due to biological origin (tumor vs. healthy) are directly apparent.
Horizontal integration on Quark: UMAP Batch view of integrated datasets
Clustering by batch enhances and accelerates insights about variations in the distribution of cellular populations between normal and tumor cells. For example, the TME (BT1298) appears enriched with distinct subpopulations of CD8 T cells, alveolar fibroblasts, CD4 T cells, AT2 proliferating cells, plasma cells, and basal resting cells, compared to the normal and cancer-core lung tissue samples.
The annotated cell sub-populations can be further validated using marker genes. For example, the distribution of the cancer marker gene Epithelial Cell Adhesion Molecule (EPCAM) showed that it primarily clusters to the TME.
Horizontal integration on Quark: Distribution of EPCAM in clustering visualizations
Similarly, a visual comparison of EPCAM and endothelial tumour marker gene distribution using UMAP (Cluster and Batch Views) allows the direct comparison of cellular subpopulations, based on differential expression patterns.
Horizontal integration on Quark: EPCAM primarily distributes to the TME, whereas endothelial tumor marker gene HSPG2 appears be predominant within the core lung tissue
Thus, with horizontal integration, researchers can get accelerated biological insights within a larger biological context.
Leveraging Quark’s instant no-code visualizations, researchers may contrast and compare marker gene expressions, directly interpreting their data in the context of their research.
Differential gene expression: visualization and analysis
Early exploratory insights about differentially expressed genes can be inferred from individual volcano plots and heat maps.
The volcano plot below illustrates the enrichment of differentially expressed genes in the TME. The table lists significantly enriched genes in the cluster (TME subpopulation of cells), accelerating early exploratory gene expression insights.
Differential gene expression: Volcano plot (of the TME cluster) and the scRNA-seq heatmap show top differentially expressed genes in each cluster
For example, the volcano plot for the TME cluster of cells lists LYPD-3 (LY6/PLAUR domain containing 3), MARK4 (Microtubule affinity regulating kinase 4) and TMEM140 (Transmembrane protein 140) as top 3 significantly enriched genes that are markers known to be associated with cancer metastasis.
Further exploration of other cell clusters by subpopulation allows for a direct comparison of gene expression data, enabling researchers to get early insights into altered pathways in the tumor microenvironment.
Conclusion
Single-cell technologies have revolutionised the multi-omics landscape, enabling researchers to comprehensively map individual cells and capture the full scope of their heterogeneity. This plays a crucial role in characterizing cancer subpopulations, and identifying cancer predisposing states.
However, the inherent complexity of scRNA-seq data complicates the design of end-to-end analysis workflows that can adequately address the requirements of quality control, batch effects, and downstream analysis.
One approach to simultaneously denoise data while enabling a direct comparison of multiple datasets is horizontal integration.
Quark leverages horizontal integration to allow for a more comprehensive and robust interpretation of scRNA-seq data. With this workflow, users can directly compare their datasets and interpret gene expression within a larger biological context.
With Quark, researchers can leverage the platform’s intuitive design to streamline their data analysis, using in-built analytics and visualizations to get accelerated biological insights.
Request a demo to learn more about Quark.