Tutorial 3: Data integration Using scgen¶

Import Packages and load datasets¶

[1]:

import os
import sys

from warnings import filterwarnings

[2]:

sys.path.append(os.getcwd())
filterwarnings("ignore")

[3]:

import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData

[4]:

dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])

Preprocessing datasets¶

[5]:

stereoAlign.pp.summarize_counts(dataset)

[6]:

stereoAlign.pp.norma_log(dataset)

Optional:¶

Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.

[7]:

dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")

Visualization original dataset¶

Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.

[8]:

stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)

PCA
Nearest Neigbours
UMAP

[9]:

sc.tl.leiden(dataset)

[10]:

sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-3-Data-integration-Using-scgen_14_0.png

Calculation metric score using original datasets¶

[11]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    dataset,
    key="batch",
    use_rep="X_umap",
    alpha=0.1,
    n_neighbors=15)

[12]:

stat_mean, pvalue_mean, accept_rate

[12]:

(17.322861939374246, 0.0798263433989164, 0.18820134658544405)

[13]:

stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")

[13]:

0.9899808657908593

[14]:

stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")

[14]:

0.586473137140274

Using `scgen` to integration datasets¶

[15]:

scgen_dataset = dataset.copy()
scgen_corrected = stereoAlign.alg.scgen_alignment(scgen_dataset, batch_key="batch", cell_type="celltype")

/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/torchmetrics/utilities/imports.py:24: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  _PYTHON_LOWER_3_8 = LooseVersion(_PYTHON_VERSION) < LooseVersion("3.8")
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'):
/home/zhangchao/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/__init__.py:38: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pytorch_lightning')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  __import__("pkg_resources").declare_namespace(__name__)
Global seed set to 0
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

Epoch 43/200:  22%|█████████████████████████████████████████████████                                                                                                                                                                                   | 43/200 [00:59<03:38,  1.39s/it, loss=9.05e+03, v_num=1]
Monitored metric elbo_validation did not improve in the last 25 records. Best score: 18560.293. Signaling Trainer to stop.
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

[16]:

scgen_corrected

[16]:

AnnData object with n_obs × n_vars = 3119 × 20341
    obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden', '_scvi_batch', '_scvi_labels'
    uns: '_scvi_uuid', '_scvi_manager_uuid'
    obsm: 'latent', 'corrected_latent'

Visualization of integrated datasets¶

[17]:

stereoAlign.pp.reduce_data(
    scgen_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="corrected_latent", umap=True)

Nearest Neigbours
UMAP

[18]:

sc.tl.leiden(scgen_corrected)

[19]:

sc.pl.umap(scgen_corrected, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-3-Data-integration-Using-scgen_26_0.png

Calculation metirc score using integrated datasets by `scgen`¶

In order to quantify the performance and efficacy of the integrated datasets using the powerful scgen, we can calculate the metric score.

This score serves as a reliable measure of the harmonious blending and seamless integration achieved through the amalgamation process.

Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.

[20]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    scgen_corrected,
    key="batch",
    use_rep="X_umap",
    alpha=0.05,
    n_neighbors=15)

[21]:

stat_mean, pvalue_mean, accept_rate

[21]:

(8.770617830664426, 0.21016846007911097, 0.51651170246874)

Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the ilisi_mean value, the better.

[22]:

ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scgen_corrected,
    key="batch",
    use_rep="X_umap",
    n_neighbors=15)

[23]:

ilisi_mean, lower, upper

[23]:

(2.0729667767712017, 2.051760463522227, 2.0941730900201763)

The smaller the clisi_mean value, the better.

[24]:

clisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scgen_corrected,
    key="celltype",
    use_rep="X_umap",
    n_neighbors=15)

[25]:

clisi_mean, lower, upper

[25]:

(1.2174320270506576, 1.2052937441683194, 1.2295703099329958)

Quantify the connectivity of the subgraph per cell type label.

[26]:

stereoAlign.metrics.graph_connectivity(scgen_corrected, label_key="celltype")

[26]:

0.992148750162326

Average silhouette width (ASW)

The values range from [-1, 1] with

* 1 indicates distinct, compact clusters
* 0 indicates overlapping clusters
* -1 indicates core-periphery (non-cluster) structure

By default, the score is scaled between 0 and 1 (scale=True).

[27]:

stereoAlign.metrics.silhouette(scgen_corrected, label_key="celltype", embed="X_umap")

[27]:

0.6196084395051003

Modified average silhouette width (ASW) of batch

This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.

[28]:

stereoAlign.metrics.silhouette_batch(
    scgen_corrected, batch_key="batch", label_key="celltype", embed="X_umap")

mean silhouette per group:                                 silhouette_score
group
glomerular layer (GL)                   0.774641
granule cell layer (GCL)                0.939101
mitral cell layer (MCL)                 0.918451
olfactory nerve layer (ONL)             0.897297
rostral migratory stream (RMS)          0.731618

[28]:

0.8522213947665455

[ ]:

Tutorial 3: Data integration Using scgen¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using `scgen` to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by `scgen`¶

stereoAlign

Navigation

Related Topics

Tutorial 3: Data integration Using scgen¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using scgen to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by scgen¶

Using `scgen` to integration datasets¶

Calculation metirc score using integrated datasets by `scgen`¶