Tutorial 4: Data integration Using scvi¶

Import Packages and load datasets¶

[1]:

import os
import sys

from warnings import filterwarnings

[2]:

sys.path.append(os.getcwd())
filterwarnings("ignore")

[3]:

import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData

[4]:

dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])

[5]:

dataset.layers["counts"] = dataset.X

Preprocessing datasets¶

[6]:

stereoAlign.pp.summarize_counts(dataset)

[7]:

stereoAlign.pp.norma_log(dataset)

Optional:¶

Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.

[8]:

dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")

Visualization original dataset¶

Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.

[9]:

stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)

PCA
Nearest Neigbours
UMAP

[10]:

sc.tl.leiden(dataset)

[11]:

sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-4-Data-integration-Using-scvi_15_0.png

Calculation metric score using original datasets¶

[12]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    dataset,
    key="batch",
    use_rep="X_umap",
    alpha=0.1,
    n_neighbors=15)

[13]:

stat_mean, pvalue_mean, accept_rate

[13]:

(17.007220016904856, 0.07179906985685233, 0.18820134658544405)

[14]:

stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")

[14]:

0.9879300381400007

[15]:

stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")

[15]:

0.5871943160891533

Using `scvi` to integration datasets¶

[31]:

scvi_dataset = dataset.copy()
scvi_corrected = stereoAlign.alg.scvi_alignment(scvi_dataset, batch_key="batch")

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:34<00:00,  1.86it/s, loss=1.52e+04, v_num=1]

[32]:

scvi_corrected

[32]:

AnnData object with n_obs × n_vars = 3119 × 20341
    obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
    var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
    uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
    obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances', 'aligned_scvi'
    layers: 'counts'
    obsp: 'distances', 'connectivities'

Visualization of integrated datasets¶

[33]:

stereoAlign.pp.reduce_data(
    scvi_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="aligned_scvi", umap=True)

Nearest Neigbours
UMAP

[34]:

sc.tl.leiden(scvi_corrected)

[35]:

sc.pl.umap(scvi_corrected, color=["batch", "leiden", "celltype"])

../_images/notebook_Tutorial-4-Data-integration-Using-scvi_27_0.png

Calculation metirc score using integrated datasets by `scvi`¶

In order to quantify the performance and efficacy of the integrated datasets using the powerful scvi, we can calculate the metric score.

This score serves as a reliable measure of the harmonious blending and seamless integration achieved through the amalgamation process.

Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.
Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.

[36]:

stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    scvi_corrected,
    key="batch",
    use_rep="X_umap",
    alpha=0.05,
    n_neighbors=15)

[37]:

stat_mean, pvalue_mean, accept_rate

[37]:

(11.201341167662713, 0.10808238860817768, 0.3202949663353639)

Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.
The larger the ilisi_mean value, the better.

[38]:

ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scvi_corrected,
    key="batch",
    use_rep="X_umap",
    n_neighbors=15)

[39]:

ilisi_mean, lower, upper

[39]:

(1.8549566822968488, 1.8345105450240138, 1.875402819569684)

The smaller the clisi_mean value, the better.

[40]:

clisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scvi_corrected,
    key="celltype",
    use_rep="X_umap",
    n_neighbors=15)

[41]:

clisi_mean, lower, upper

[41]:

(1.2340061715642556, 1.2223720004958119, 1.2456403426326994)

Quantify the connectivity of the subgraph per cell type label.

[42]:

stereoAlign.metrics.graph_connectivity(scvi_corrected, label_key="celltype")

[42]:

0.9984489211045815

Average silhouette width (ASW)

The values range from [-1, 1] with

* 1 indicates distinct, compact clusters
* 0 indicates overlapping clusters
* -1 indicates core-periphery (non-cluster) structure

By default, the score is scaled between 0 and 1 (scale=True).

[43]:

stereoAlign.metrics.silhouette(scvi_corrected, label_key="celltype", embed="X_umap")

[43]:

0.6253883689641953

Modified average silhouette width (ASW) of batch

This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.

[44]:

stereoAlign.metrics.silhouette_batch(
    scvi_corrected, batch_key="batch", label_key="celltype", embed="X_umap")

mean silhouette per group:                                 silhouette_score
group
glomerular layer (GL)                   0.702815
granule cell layer (GCL)                0.924770
mitral cell layer (MCL)                 0.827885
olfactory nerve layer (ONL)             0.880480
rostral migratory stream (RMS)          0.651708

[44]:

0.7975316708632418

[ ]:

Tutorial 4: Data integration Using scvi¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using `scvi` to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by `scvi`¶

stereoAlign

Navigation

Related Topics

Tutorial 4: Data integration Using scvi¶

Import Packages and load datasets¶

Preprocessing datasets¶

Optional:¶

Visualization original dataset¶

Calculation metric score using original datasets¶

Using scvi to integration datasets¶

Visualization of integrated datasets¶

Calculation metirc score using integrated datasets by scvi¶

Using `scvi` to integration datasets¶

Calculation metirc score using integrated datasets by `scvi`¶