Tutorial 3: Data integration Using scgen¶
Import Packages and load datasets¶
[1]:
import os
import sys
from warnings import filterwarnings
[2]:
sys.path.append(os.getcwd())
filterwarnings("ignore")
[3]:
import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData
[4]:
dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])
Preprocessing datasets¶
[5]:
stereoAlign.pp.summarize_counts(dataset)
[6]:
stereoAlign.pp.norma_log(dataset)
Optional:¶
Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.
[7]:
dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")
Visualization original dataset¶
Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.
[8]:
stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)
PCA
Nearest Neigbours
UMAP
[9]:
sc.tl.leiden(dataset)
[10]:
sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])
Calculation metric score using original datasets¶
[11]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
dataset,
key="batch",
use_rep="X_umap",
alpha=0.1,
n_neighbors=15)
[12]:
stat_mean, pvalue_mean, accept_rate
[12]:
(17.322861939374246, 0.0798263433989164, 0.18820134658544405)
[13]:
stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")
[13]:
0.9899808657908593
[14]:
stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")
[14]:
0.586473137140274
Using scgen to integration datasets¶
[15]:
scgen_dataset = dataset.copy()
scgen_corrected = stereoAlign.alg.scgen_alignment(scgen_dataset, batch_key="batch", cell_type="celltype")
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/torchmetrics/utilities/imports.py:24: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_PYTHON_LOWER_3_8 = LooseVersion(_PYTHON_VERSION) < LooseVersion("3.8")
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'):
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/__init__.py:38: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pytorch_lightning')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
__import__("pkg_resources").declare_namespace(__name__)
Global seed set to 0
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Epoch 43/200: 22%|█████████████████████████████████████████████████ | 43/200 [00:59<03:38, 1.39s/it, loss=9.05e+03, v_num=1]
Monitored metric elbo_validation did not improve in the last 25 records. Best score: 18560.293. Signaling Trainer to stop.
INFO Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
[16]:
scgen_corrected
[16]:
AnnData object with n_obs × n_vars = 3119 × 20341
obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden', '_scvi_batch', '_scvi_labels'
uns: '_scvi_uuid', '_scvi_manager_uuid'
obsm: 'latent', 'corrected_latent'
Visualization of integrated datasets¶
[17]:
stereoAlign.pp.reduce_data(
scgen_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="corrected_latent", umap=True)
Nearest Neigbours
UMAP
[18]:
sc.tl.leiden(scgen_corrected)
[19]:
sc.pl.umap(scgen_corrected, color=["batch", "leiden", "celltype"])
Calculation metirc score using integrated datasets by scgen¶
scgen, we can calculate the metric score.Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.
[20]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
scgen_corrected,
key="batch",
use_rep="X_umap",
alpha=0.05,
n_neighbors=15)
[21]:
stat_mean, pvalue_mean, accept_rate
[21]:
(8.770617830664426, 0.21016846007911097, 0.51651170246874)
Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the
ilisi_meanvalue, the better.
[22]:
ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
scgen_corrected,
key="batch",
use_rep="X_umap",
n_neighbors=15)
[23]:
ilisi_mean, lower, upper
[23]:
(2.0729667767712017, 2.051760463522227, 2.0941730900201763)
The smaller the
clisi_meanvalue, the better.
[24]:
clisi_mean, lower, upper = stereoAlign.metrics.lisi(
scgen_corrected,
key="celltype",
use_rep="X_umap",
n_neighbors=15)
[25]:
clisi_mean, lower, upper
[25]:
(1.2174320270506576, 1.2052937441683194, 1.2295703099329958)
Quantify the connectivity of the subgraph per cell type label.
[26]:
stereoAlign.metrics.graph_connectivity(scgen_corrected, label_key="celltype")
[26]:
0.992148750162326
Average silhouette width (ASW)
The values range from [-1, 1] with
* 1 indicates distinct, compact clusters * 0 indicates overlapping clusters * -1 indicates core-periphery (non-cluster) structure
By default, the score is scaled between 0 and 1 (
scale=True).
[27]:
stereoAlign.metrics.silhouette(scgen_corrected, label_key="celltype", embed="X_umap")
[27]:
0.6196084395051003
Modified average silhouette width (ASW) of batch
This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.
[28]:
stereoAlign.metrics.silhouette_batch(
scgen_corrected, batch_key="batch", label_key="celltype", embed="X_umap")
mean silhouette per group: silhouette_score
group
glomerular layer (GL) 0.774641
granule cell layer (GCL) 0.939101
mitral cell layer (MCL) 0.918451
olfactory nerve layer (ONL) 0.897297
rostral migratory stream (RMS) 0.731618
[28]:
0.8522213947665455
[ ]: