Tutorial 4: Data integration Using scvi¶
Import Packages and load datasets¶
[1]:
import os
import sys
from warnings import filterwarnings
[2]:
sys.path.append(os.getcwd())
filterwarnings("ignore")
[3]:
import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData
[4]:
dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])
[5]:
dataset.layers["counts"] = dataset.X
Preprocessing datasets¶
[6]:
stereoAlign.pp.summarize_counts(dataset)
[7]:
stereoAlign.pp.norma_log(dataset)
Optional:¶
Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.
[8]:
dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")
Visualization original dataset¶
Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.
[9]:
stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)
PCA
Nearest Neigbours
UMAP
[10]:
sc.tl.leiden(dataset)
[11]:
sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])
Calculation metric score using original datasets¶
[12]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
dataset,
key="batch",
use_rep="X_umap",
alpha=0.1,
n_neighbors=15)
[13]:
stat_mean, pvalue_mean, accept_rate
[13]:
(17.007220016904856, 0.07179906985685233, 0.18820134658544405)
[14]:
stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")
[14]:
0.9879300381400007
[15]:
stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")
[15]:
0.5871943160891533
Using scvi to integration datasets¶
[31]:
scvi_dataset = dataset.copy()
scvi_corrected = stereoAlign.alg.scvi_alignment(scvi_dataset, batch_key="batch")
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:34<00:00, 1.86it/s, loss=1.52e+04, v_num=1]
[32]:
scvi_corrected
[32]:
AnnData object with n_obs × n_vars = 3119 × 20341
obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances', 'aligned_scvi'
layers: 'counts'
obsp: 'distances', 'connectivities'
Visualization of integrated datasets¶
[33]:
stereoAlign.pp.reduce_data(
scvi_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="aligned_scvi", umap=True)
Nearest Neigbours
UMAP
[34]:
sc.tl.leiden(scvi_corrected)
[35]:
sc.pl.umap(scvi_corrected, color=["batch", "leiden", "celltype"])
Calculation metirc score using integrated datasets by scvi¶
scvi, we can calculate the metric score.Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.
[36]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
scvi_corrected,
key="batch",
use_rep="X_umap",
alpha=0.05,
n_neighbors=15)
[37]:
stat_mean, pvalue_mean, accept_rate
[37]:
(11.201341167662713, 0.10808238860817768, 0.3202949663353639)
Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the
ilisi_meanvalue, the better.
[38]:
ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
scvi_corrected,
key="batch",
use_rep="X_umap",
n_neighbors=15)
[39]:
ilisi_mean, lower, upper
[39]:
(1.8549566822968488, 1.8345105450240138, 1.875402819569684)
The smaller the
clisi_meanvalue, the better.
[40]:
clisi_mean, lower, upper = stereoAlign.metrics.lisi(
scvi_corrected,
key="celltype",
use_rep="X_umap",
n_neighbors=15)
[41]:
clisi_mean, lower, upper
[41]:
(1.2340061715642556, 1.2223720004958119, 1.2456403426326994)
Quantify the connectivity of the subgraph per cell type label.
[42]:
stereoAlign.metrics.graph_connectivity(scvi_corrected, label_key="celltype")
[42]:
0.9984489211045815
Average silhouette width (ASW)
The values range from [-1, 1] with
* 1 indicates distinct, compact clusters * 0 indicates overlapping clusters * -1 indicates core-periphery (non-cluster) structure
By default, the score is scaled between 0 and 1 (
scale=True).
[43]:
stereoAlign.metrics.silhouette(scvi_corrected, label_key="celltype", embed="X_umap")
[43]:
0.6253883689641953
Modified average silhouette width (ASW) of batch
This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.
[44]:
stereoAlign.metrics.silhouette_batch(
scvi_corrected, batch_key="batch", label_key="celltype", embed="X_umap")
mean silhouette per group: silhouette_score
group
glomerular layer (GL) 0.702815
granule cell layer (GCL) 0.924770
mitral cell layer (MCL) 0.827885
olfactory nerve layer (ONL) 0.880480
rostral migratory stream (RMS) 0.651708
[44]:
0.7975316708632418
[ ]: