Tutorial 2: Data integration Using BBKNN¶
Import Packages and load datasets¶
[1]:
import os
import sys
from warnings import filterwarnings
[2]:
sys.path.append(os.getcwd())
filterwarnings("ignore")
[3]:
import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData
[4]:
dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])
Preprocessing datasets¶
[5]:
stereoAlign.pp.summarize_counts(dataset)
[6]:
stereoAlign.pp.norma_log(dataset)
Optional:¶
Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.
[7]:
dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")
Visualization original dataset¶
Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.
[8]:
stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)
PCA
Nearest Neigbours
UMAP
[9]:
sc.tl.leiden(dataset)
[10]:
sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])
Calculation metric score using original datasets¶
[11]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
dataset,
key="batch",
use_rep="X_umap",
alpha=0.1,
n_neighbors=15)
[12]:
stat_mean, pvalue_mean, accept_rate
[12]:
(17.377788396500012, 0.08517801527945788, 0.19493427380570696)
[13]:
stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")
[13]:
0.9917491749174918
[14]:
stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")
[14]:
0.5771105885505676
Using BBKNN to integration datasets¶
[15]:
bbknn_dataset = dataset.copy()
bbknn_corrected = stereoAlign.alg.bbknn_alignment(bbknn_dataset, batch_key="batch")
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/faiss/loader.py:28: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(numpy.__version__) >= "1.19":
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/setuptools/_distutils/version.py:345: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
[16]:
bbknn_corrected
[16]:
AnnData object with n_obs × n_vars = 3119 × 20341
obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances'
obsp: 'distances', 'connectivities'
Visualization of integrated datasets¶
[17]:
stereoAlign.pp.reduce_data(
bbknn_corrected, pca=True, pca_comps=100, neighbors=False, use_rep="X_pca", umap=True)
PCA
UMAP
[18]:
sc.tl.leiden(bbknn_corrected)
[19]:
sc.pl.umap(bbknn_corrected, color=["batch", "leiden", "celltype"])
Calculation metirc score using integrated datasets by BBKNN¶
BBKNN, we can calculate the metric score.Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.
[20]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
bbknn_corrected,
key="batch",
use_rep="X_umap",
alpha=0.05,
n_neighbors=15)
[21]:
stat_mean, pvalue_mean, accept_rate
[21]:
(7.685325943884262, 0.18556349233809058, 0.5338249438922732)
Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the
ilisi_meanvalue, the better.
[22]:
ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
bbknn_corrected,
key="batch",
use_rep="X_umap",
n_neighbors=15)
[23]:
ilisi_mean, lower, upper
[23]:
(2.099519537611782, 2.0818878012324475, 2.1171512739911167)
The smaller the
clisi_meanvalue, the better.
[24]:
clisi_mean, lower, upper = stereoAlign.metrics.lisi(
bbknn_corrected,
key="celltype",
use_rep="X_umap",
n_neighbors=15)
[25]:
clisi_mean, lower, upper
[25]:
(1.2433578961733485, 1.2310056237428135, 1.2557101686038834)
Quantify the connectivity of the subgraph per cell type label.
[26]:
stereoAlign.metrics.graph_connectivity(bbknn_corrected, label_key="celltype")
[26]:
0.9910141958232087
Average silhouette width (ASW)
The values range from [-1, 1] with
* 1 indicates distinct, compact clusters * 0 indicates overlapping clusters * -1 indicates core-periphery (non-cluster) structure
By default, the score is scaled between 0 and 1 (
scale=True).
[27]:
stereoAlign.metrics.silhouette(bbknn_corrected, label_key="celltype", embed="X_umap")
[27]:
0.6331341564655304
Modified average silhouette width (ASW) of batch
This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.
[28]:
stereoAlign.metrics.silhouette_batch(
bbknn_corrected, batch_key="batch", label_key="celltype", embed="X_umap")
mean silhouette per group: silhouette_score
group
glomerular layer (GL) 0.721066
granule cell layer (GCL) 0.872616
mitral cell layer (MCL) 0.845666
olfactory nerve layer (ONL) 0.872244
rostral migratory stream (RMS) 0.698698
[28]:
0.8020580056902566
[ ]: