Tutorial 4: Data integration Using scvi

Import Packages and load datasets

[1]:
import os
import sys

from warnings import filterwarnings
[2]:
sys.path.append(os.getcwd())
filterwarnings("ignore")
[3]:
import stereoAlign
import scanpy as sc
import numpy as np
from anndata import AnnData
[4]:
dataset = AnnData.concatenate(*[sc.read_h5ad(os.path.join("demo_data", f)) for f in sorted(os.listdir("demo_data/"))])
[5]:
dataset.layers["counts"] = dataset.X

Preprocessing datasets

[6]:
stereoAlign.pp.summarize_counts(dataset)
[7]:
stereoAlign.pp.norma_log(dataset)

Optional:

Scaling counts to a mean of 0 and standard deviation of 1 using scanpy.pp.scale for each batch separately. It is possible to effectively alleviate the impact of minor batch effects.

[8]:
dataset = stereoAlign.pp.scale_batch(dataset, batch="batch")

Visualization original dataset

Before proceeding with the integration process, it is essential to visualize the original dataset in its unadulterated form. This visualization serves as a valuable reference point for understanding the inherent characteristics and structure of the data.

[9]:
stereoAlign.pp.reduce_data(dataset, pca=True, pca_comps=100, neighbors=True, use_rep="X_pca", umap=True)
PCA
Nearest Neigbours
UMAP
[10]:
sc.tl.leiden(dataset)
[11]:
sc.pl.umap(dataset, color=["batch", "leiden", "celltype"])
../_images/notebook_Tutorial-4-Data-integration-Using-scvi_15_0.png

Calculation metric score using original datasets

[12]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    dataset,
    key="batch",
    use_rep="X_umap",
    alpha=0.1,
    n_neighbors=15)
[13]:
stat_mean, pvalue_mean, accept_rate
[13]:
(17.007220016904856, 0.07179906985685233, 0.18820134658544405)
[14]:
stereoAlign.metrics.graph_connectivity(dataset, label_key="celltype")
[14]:
0.9879300381400007
[15]:
stereoAlign.metrics.silhouette(dataset, label_key="celltype", embed="X_umap")
[15]:
0.5871943160891533

Using scvi to integration datasets

[31]:
scvi_dataset = dataset.copy()
scvi_corrected = stereoAlign.alg.scvi_alignment(scvi_dataset, batch_key="batch")
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Epoch 400/400: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [03:34<00:00,  1.86it/s, loss=1.52e+04, v_num=1]
[32]:
scvi_corrected
[32]:
AnnData object with n_obs × n_vars = 3119 × 20341
    obs: 'celltype', 'batch', 'n_counts', 'log_counts', 'n_genes', 'leiden'
    var: 'n_cells', 'mean-0', 'std-0', 'mean-1', 'std-1', 'mean-2', 'std-2'
    uns: 'pca', 'neighbors', 'umap', 'leiden', 'batch_colors', 'leiden_colors', 'celltype_colors'
    obsm: 'spatial', 'X_pca', 'X_umap', 'X_umap_knn_connectivity', 'X_umap_knn_distances', 'aligned_scvi'
    layers: 'counts'
    obsp: 'distances', 'connectivities'

Visualization of integrated datasets

[33]:
stereoAlign.pp.reduce_data(
    scvi_corrected, pca=False, pca_comps=100, neighbors=True, use_rep="aligned_scvi", umap=True)
Nearest Neigbours
UMAP
[34]:
sc.tl.leiden(scvi_corrected)
[35]:
sc.pl.umap(scvi_corrected, color=["batch", "leiden", "celltype"])
../_images/notebook_Tutorial-4-Data-integration-Using-scvi_27_0.png

Calculation metirc score using integrated datasets by scvi

In order to quantify the performance and efficacy of the integrated datasets using the powerful scvi, we can calculate the metric score.
This score serves as a reliable measure of the harmonious blending and seamless integration achieved through the amalgamation process.
  • Calculate the K-nearest neighbors Batch Effects Test (K-BET) metric of the data regarding a specific sample attribute and embedding. The K-BET metric measures if cells from different samples mix well in their local neighborhood.

  • Should the p-value surpass the predetermined alpha threshold, it is indicative that the data batch effect has indeed been successfully eradicated. This serves as a testament to the efficacy of the applied methodologies and techniques employed in the removal process.

[36]:
stat_mean, pvalue_mean, accept_rate = stereoAlign.metrics.kbet(
    scvi_corrected,
    key="batch",
    use_rep="X_umap",
    alpha=0.05,
    n_neighbors=15)
[37]:
stat_mean, pvalue_mean, accept_rate
[37]:
(11.201341167662713, 0.10808238860817768, 0.3202949663353639)
  • Calculate the Local inverse Simpson’s Index (LISI) metric of the data regarding a specific sample attribute and embedding. The LISI metric measures if cells from different samples mix well in their local neighborhood.

  • The larger the ilisi_mean value, the better.

[38]:
ilisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scvi_corrected,
    key="batch",
    use_rep="X_umap",
    n_neighbors=15)
[39]:
ilisi_mean, lower, upper
[39]:
(1.8549566822968488, 1.8345105450240138, 1.875402819569684)
  • The smaller the clisi_mean value, the better.

[40]:
clisi_mean, lower, upper = stereoAlign.metrics.lisi(
    scvi_corrected,
    key="celltype",
    use_rep="X_umap",
    n_neighbors=15)
[41]:
clisi_mean, lower, upper
[41]:
(1.2340061715642556, 1.2223720004958119, 1.2456403426326994)
  • Quantify the connectivity of the subgraph per cell type label.

[42]:
stereoAlign.metrics.graph_connectivity(scvi_corrected, label_key="celltype")
[42]:
0.9984489211045815
  • Average silhouette width (ASW)

    The values range from [-1, 1] with

    * 1 indicates distinct, compact clusters
    * 0 indicates overlapping clusters
    * -1 indicates core-periphery (non-cluster) structure
    

    By default, the score is scaled between 0 and 1 (scale=True).

[43]:
stereoAlign.metrics.silhouette(scvi_corrected, label_key="celltype", embed="X_umap")
[43]:
0.6253883689641953
  • Modified average silhouette width (ASW) of batch

    This metric measures the silhouette of a given batch. It assumes that a silhouette width close to 0 represents perfect overlap of the batches, thus the absolute value of the silhouette width is used to measure how well batches are mixed.

[44]:
stereoAlign.metrics.silhouette_batch(
    scvi_corrected, batch_key="batch", label_key="celltype", embed="X_umap")
mean silhouette per group:                                 silhouette_score
group
glomerular layer (GL)                   0.702815
granule cell layer (GCL)                0.924770
mitral cell layer (MCL)                 0.827885
olfactory nerve layer (ONL)             0.880480
rostral migratory stream (RMS)          0.651708
[44]:
0.7975316708632418
[ ]: