Tutorial 1: Data integration Using Harmony¶
Import Packages and load datasets¶
[1]:
import os
import sys
[2]:
sys.path.append(os.getcwd())
[3]:
import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData
[4]:
dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])
Preprocessing datasets¶
[5]:
stereoAlign.pp.summarize_counts(dataset)
[6]:
stereoAlign.pp.norma_log(dataset)
Optional:¶
Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.
[7]:
dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")
Visualization original dataset¶
Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.
[8]:
stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)
PCA
Nearest Neigbours
UMAP
[9]:
sc.tl.leiden(dataset)
[10]:
sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])
Calculation metric score using original datasets¶
[11]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
dataset,
key="batch",
use_rep="X_umap",
alpha=0.1,
n_neighbors=15)
[12]:
stat_mean, pvalue_mean, accept_rate
[12]:
(17.30697483381354, 0.08259827579077056, 0.1910868868226996)
[13]:
stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")
[13]:
0.9896508327875588
[14]:
stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")
[14]:
0.5907675176858902
Using Harmony to integration datasets¶
The corrected feature matrix will be saved in *.obsm["aligned_harmony"]
[15]:
harmony_dataset = dataset.copy()
harmony_corrected = stereoAlign.alg.harmony_alignment(harmony_dataset, batch_key="batch")
Initialization is completed.
Completed 1 / 10 iteration(s).
Completed 2 / 10 iteration(s).
Completed 3 / 10 iteration(s).
Completed 4 / 10 iteration(s).
Completed 5 / 10 iteration(s).
Completed 6 / 10 iteration(s).
Completed 7 / 10 iteration(s).
Completed 8 / 10 iteration(s).
Completed 9 / 10 iteration(s).
Completed 10 / 10 iteration(s).
[16]:
harmony_corrected
[16]:
AnnData object with n_obs × n_vars = 3119 × 20341
obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances', 'aligned_harmony'
obsp: 'distances', 'connectivities'
Visualization of integrated datasets¶
[17]:
stereoAlign.pp.reduce_data(
harmony_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="aligned_harmony", umap=True)
Nearest Neigbours
UMAP
[18]:
sc.tl.leiden(harmony_corrected)
[19]:
sc.pl.umap(harmony_dataset, color=["batch", "leiden", "celltype"])
Calculation metirc score using integrated datasets by Harmony¶
Harmony, we can calculate the metric score.Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.
[20]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
harmony_corrected,
key="batch",
use_rep="aligned_harmony",
alpha=0.05,
n_neighbors=15)
[21]:
stat_mean, pvalue_mean, accept_rate
[21]:
(11.041335043865018, 0.07790559676142805, 0.2792561718499519)
Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the
ilisi_meanvalue, the better.
[22]:
ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
harmony_corrected,
key="batch",
use_rep="X_umap",
n_neighbors=15)
[23]:
ilisi_mean, lower, upper
[23]:
(2.501600357575848, 2.4877128083610844, 2.5154879067906113)
The smaller the
clisi_meanvalue, the better.
[24]:
clisi_mean, lower, upper = stereoAlign.metrics.lisi(
harmony_corrected,
key="celltype",
use_rep="X_umap",
n_neighbors=15)
[25]:
clisi_mean, lower, upper
[25]:
(1.2669337564931125, 1.2536753704133714, 1.2801921425728535)
Quantify the connectivity of the subgraph per cell type label.
[26]:
stereoAlign.metrics.graph_connectivity(harmony_corrected, label_key="celltype")
[26]:
0.985720075040572
Average silhouette width (ASW)
The values range from [-1, 1] with
* 1 indicates distinct, compact clusters * 0 indicates overlapping clusters * -1 indicates core-periphery (non-cluster) structure
By default, the score is scaled between 0 and 1 (
scale=True).
[27]:
stereoAlign.metrics.silhouette(harmony_corrected, label_key="celltype", embed="X_umap")
[27]:
0.6264222264289856
Modified average silhouette width (ASW) of batch
This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.
[28]:
stereoAlign.metrics.silhouette_batch(
harmony_corrected, batch_key="batch", label_key="celltype", embed="X_umap")
mean silhouette per group: silhouette_score
group
glomerular layer (GL) 0.726641
granule cell layer (GCL) 0.937189
mitral cell layer (MCL) 0.885625
olfactory nerve layer (ONL) 0.907767
rostral migratory stream (RMS) 0.616016
[28]:
0.8146478069247627
[ ]: