Tutorial 1: Data integration Using Harmony¶

Import Packages and load datasets¶

[1]:

import os
import sys

[2]:

sys.path.append(os.getcwd())

[3]:

import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData

[4]:

dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])

Preprocessing datasets¶

[5]:

stereoAlign.pp.summarize_counts(dataset)

[6]:

stereoAlign.pp.norma_log(dataset)

Optional:¶

Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.

[7]:

dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")

Visualization original dataset¶

Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.

[8]:

stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)

PCA
Nearest Neigbours
UMAP

[9]:

sc.tl.leiden(dataset)

[10]:

sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-1-Data-integration-Using-Harmony_14_0.png

Calculation metric score using original datasets¶

[11]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    dataset,
    key="batch",
    use_rep="X_umap",
    alpha=0.1,
    n_neighbors=15)

[12]:

stat_mean, pvalue_mean, accept_rate

[12]:

(17.30697483381354, 0.08259827579077056, 0.1910868868226996)

[13]:

stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")

[13]:

0.9896508327875588

[14]:

stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")

[14]:

0.5907675176858902

Using `Harmony` to integration datasets¶

The corrected feature matrix will be saved in *.obsm["aligned_harmony"]

[15]:

harmony_dataset = dataset.copy()
harmony_corrected = stereoAlign.alg.harmony_alignment(harmony_dataset, batch_key="batch")

        Initialization is completed.
        Completed 1 / 10 iteration(s).
        Completed 2 / 10 iteration(s).
        Completed 3 / 10 iteration(s).
        Completed 4 / 10 iteration(s).
        Completed 5 / 10 iteration(s).
        Completed 6 / 10 iteration(s).
        Completed 7 / 10 iteration(s).
        Completed 8 / 10 iteration(s).
        Completed 9 / 10 iteration(s).
        Completed 10 / 10 iteration(s).

[16]:

harmony_corrected

[16]:

AnnData object with n_obs × n_vars = 3119 × 20341
    obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
    var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
    uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
    obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances', 'aligned_harmony'
    obsp: 'distances', 'connectivities'

Visualization of integrated datasets¶

[17]:

stereoAlign.pp.reduce_data(
    harmony_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="aligned_harmony", umap=True)

Nearest Neigbours
UMAP

[18]:

sc.tl.leiden(harmony_corrected)

[19]:

sc.pl.umap(harmony_dataset, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-1-Data-integration-Using-Harmony_26_0.png

Calculation metirc score using integrated datasets by `Harmony`¶

In order to quantify the performance and efficacy of the integrated datasets using the powerful Harmony, we can calculate the metric score.

This score serves as a reliable measure of the harmonious blending and seamless integration achieved through the amalgamation process.

Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.

[20]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    harmony_corrected,
    key="batch",
    use_rep="aligned_harmony",
    alpha=0.05,
    n_neighbors=15)

[21]:

stat_mean, pvalue_mean, accept_rate

[21]:

(11.041335043865018, 0.07790559676142805, 0.2792561718499519)

Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the ilisi_mean value, the better.

[22]:

ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
    harmony_corrected,
    key="batch",
    use_rep="X_umap",
    n_neighbors=15)

[23]:

ilisi_mean, lower, upper

[23]:

(2.501600357575848, 2.4877128083610844, 2.5154879067906113)

The smaller the clisi_mean value, the better.

[24]:

clisi_mean, lower, upper = stereoAlign.metrics.lisi(
    harmony_corrected,
    key="celltype",
    use_rep="X_umap",
    n_neighbors=15)

[25]:

clisi_mean, lower, upper

[25]:

(1.2669337564931125, 1.2536753704133714, 1.2801921425728535)

Quantify the connectivity of the subgraph per cell type label.

[26]:

stereoAlign.metrics.graph_connectivity(harmony_corrected, label_key="celltype")

[26]:

0.985720075040572

Average silhouette width (ASW)

The values range from [-1, 1] with

* 1 indicates distinct, compact clusters
* 0 indicates overlapping clusters
* -1 indicates core-periphery (non-cluster) structure

By default, the score is scaled between 0 and 1 (scale=True).

[27]:

stereoAlign.metrics.silhouette(harmony_corrected, label_key="celltype", embed="X_umap")

[27]:

0.6264222264289856

Modified average silhouette width (ASW) of batch

This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.

[28]:

stereoAlign.metrics.silhouette_batch(
    harmony_corrected, batch_key="batch", label_key="celltype", embed="X_umap")

mean silhouette per group:                                 silhouette_score
group
glomerular layer (GL)                   0.726641
granule cell layer (GCL)                0.937189
mitral cell layer (MCL)                 0.885625
olfactory nerve layer (ONL)             0.907767
rostral migratory stream (RMS)          0.616016

[28]:

0.8146478069247627

[ ]:

Tutorial 1: Data integration Using Harmony¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using `Harmony` to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by `Harmony`¶

stereoAlign

Navigation

Related Topics

Tutorial 1: Data integration Using Harmony¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using Harmony to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by Harmony¶

Using `Harmony` to integration datasets¶

Calculation metirc score using integrated datasets by `Harmony`¶