Using CellHint for annotation-aware data integration

This notebook showcases how to perform single-cell data integration in a supervised manner using CellHint.

Only the main steps and key parameters are introduced in this notebook. Refer to detailed Usage if you want to learn more.

Download and process four spleen datasets

[1]:
import scanpy as sc
[2]:
adata = sc.read('cellhint_demo_folder/Spleen.h5ad', backup_url = 'https://celltypist.cog.sanger.ac.uk/Resources/Organ_atlas/Spleen/Spleen.h5ad')
adata
[2]:
AnnData object with n_obs × n_vars = 200664 × 74369
    obs: 'Dataset', 'donor_id', 'development_stage', 'sex', 'suspension_type', 'assay', 'Original_annotation', 'CellTypist_harmonised_group', 'cell_type', 'Curated_annotation'
    var: 'exist_in_Madissoon2020', 'exist_in_Tabula2022', 'exist_in_DominguezConde2022', 'exist_in_He2020'
    uns: 'schema_version', 'title'
    obsm: 'X_umap'

This dataset combines cells from four studies, and is one of the organ atlases in CellTypist.

[3]:
adata.obs.Dataset.value_counts()
[3]:
Madissoon et al. 2020          92049
Dominguez Conde et al. 2022    70099
Tabula Sapiens 2022            34004
He et al. 2020                  4512
Name: Dataset, dtype: int64

Log-normalised gene expression (to 10,000 counts per cell) is in .X, and raw counts are in .raw. CellHint does not rely on the latter, but here we still start from raw counts for completeness of a single-cell pipeline.

[4]:
adata = adata.raw.to_adata()

Delete everything in the adata except .X and .obs for clarity of this tutorial.

[5]:
del adata.var
del adata.uns
del adata.obsm
adata
[5]:
AnnData object with n_obs × n_vars = 200664 × 74369
    obs: 'Dataset', 'donor_id', 'development_stage', 'sex', 'suspension_type', 'assay', 'Original_annotation', 'CellTypist_harmonised_group', 'cell_type', 'Curated_annotation'

Perform a canonical single-cell workflow from normalisation, highly variable genes (HVGs) identification, scaling, PCA, neighborhood construction to low-dimensional visualisation.

[6]:
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, batch_key = 'Dataset', subset = True)
sc.pp.scale(adata, max_value = 10)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
/opt/conda/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version.
  res = method(*args, **kwargs)
/opt/conda/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version.
  res = method(*args, **kwargs)
/opt/conda/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version.
  res = method(*args, **kwargs)

N.B.: Selection of HVGs is important for most single-cell tasks. Please find out the HVGs most suiting your data.

Visualise the dataset of origin (Dataset) and donor distribution (donor_id).

[7]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
../_images/notebook_cellhint_tutorial_integration_15_0.png

As expected, strong batches exist.

Data integration guided by ground truth

The input AnnData needs two columns in .obs representing the batch confounder and unified cell annotation respectively. The aim is to integrate cells by correcting batches and preserving biology (cell annotation) using cellhint.integrate.

Here we use Dataset as the batch confounder and Curated_annotation (ground-truth cell types) as the cell annotation.

[8]:
import cellhint
[9]:
# Integrate cells with `cellhint.integrate`.
cellhint.integrate(adata, 'Dataset', 'Curated_annotation')
👀 `use_rep` is not specified, will use `'X_pca'` as the search space

With this function, CellHint will build the neighborhood graph by searching neighbors across matched cell groups in different batches, on the basis of a low-dimensional representation provided via the argument use_rep (default to PCA coordinates).

Generate a UMAP based on the reconstructed neighborhood graph.

[10]:
sc.tl.umap(adata)

Visualise the batches.

[11]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
../_images/notebook_cellhint_tutorial_integration_26_0.png

Visualise the cell types.

[12]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_28_0.png

Alternatively, use the donor_id rather than Dataset as the batch confounder.

[13]:
cellhint.integrate(adata, 'donor_id', 'Curated_annotation')
sc.tl.umap(adata)
👀 `use_rep` is not specified, will use `'X_pca'` as the search space

Visualise the batches.

[14]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
../_images/notebook_cellhint_tutorial_integration_32_0.png

Visualise the cell types.

[15]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_34_0.png

Therefore, CellHint is able to tune the data structure towards predetermined cell types while in the meantime mitigate the effects from batch confounders.

Of note, influence of cell annotation on the data structure can range from forcibly merging the same cell types to a more lenient cell grouping. This is achieved by adjusting the parameter n_meta_neighbors.

With n_meta_neighbors of 1, each cell type only has one neighboring cell type, that is, itself. This will result in strongly separated cell types in the final UMAP.

[16]:
cellhint.integrate(adata, 'donor_id', 'Curated_annotation', n_meta_neighbors = 1)
sc.tl.umap(adata)
👀 `use_rep` is not specified, will use `'X_pca'` as the search space
[17]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_38_0.png

Increasing n_meta_neighbors will loosen this restriction. For example, a n_meta_neighbors of 2 allows each cell type to have, in addition to itself, one nearest neighboring cell type based on the transcriptomic distances calculated by CellHint. This parameter defaults to 3, meaning that a linear spectrum of transcriptomic structure can possibly exist for each cell type.

[18]:
# Not run, as the default value of `n_meta_neighbors` is 3.
#cellhint.integrate(adata, 'donor_id', 'Curated_annotation', n_meta_neighbors = 3)
#sc.tl.umap(adata)

Ground-truth annotation is usually not available when you collect a number of datasets and simply concatenate them. The sections below demonstrate two approaches to obtain cell annotations for use in cellhint.integrate.

Data integration guided by CellTypist classification

cellhint.integrate requires cell annotation to be stored in the AnnData. This information can be obtained by different means. One quick way is to use available CellTypist models to annotate the data of interest (see the CellTypist model list here).

Here we annotate the data with the Immune_All_Low.pkl model as the spleen is a major lymphoid organ. Find out how to conduct automatic CellTypist classification here.

[19]:
import celltypist
adata = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True).to_adata()
👀 Invalid expression matrix in `.X`, expect log1p normalized expression to 10000 counts per cell; will try the `.raw` attribute
🔬 Input data has 200664 cells and 74369 genes
🔗 Matching reference genes in the model
🧬 6632 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 30
🗳️ Majority voting the predictions
✅ Majority voting done!

Through this function, three columns (predicted_labels, majority_voting, and conf_score) are appended to the .obs of the adata. You can also set majority_voting = False to decrease runtime, but will lose the majority_voting column.

[20]:
adata.obs[['predicted_labels', 'majority_voting', 'conf_score']]
[20]:
predicted_labels majority_voting conf_score
AAACCTGCACATTTCT-1-HCATisStab7463846 MAIT cells MAIT cells 0.996386
AAACCTGCACCGCTAG-1-HCATisStab7463846 CD16- NK cells CD16- NK cells 0.990720
AAACCTGCAGTCCTTC-1-HCATisStab7463846 Tem/Trm cytotoxic T cells Tem/Trm cytotoxic T cells 0.989506
AAACCTGCATTGGCGC-1-HCATisStab7463846 Memory B cells Memory B cells 0.997790
AAACCTGCATTTCACT-1-HCATisStab7463846 CD16- NK cells CD16- NK cells 0.996669
... ... ... ...
Spleen_cDNA_TTTGGTTTCGTCCAGG-1 Tem/Trm cytotoxic T cells Tem/Trm cytotoxic T cells 0.294200
Spleen_cDNA_TTTGTCAAGGAGTTTA-1 Naive B cells Naive B cells 0.999520
Spleen_cDNA_TTTGTCACAAGCGATG-1 CD16+ NK cells CD16+ NK cells 0.999981
Spleen_cDNA_TTTGTCACAGGGATTG-1 CD16+ NK cells CD16+ NK cells 0.091442
Spleen_cDNA_TTTGTCATCGTTGCCT-1 Tcm/Naive helper T cells Tcm/Naive helper T cells 0.840560

200664 rows × 3 columns

Integrate cells using cellhint.integrate, with donor_id as the batch confounder and majority_voting or predicted_labels as the cell annotation.

[21]:
# You can also set 'predicted_labels' here in addition to 'majority_voting'.
cellhint.integrate(adata, 'donor_id', 'majority_voting')
👀 `use_rep` is not specified, will use `'X_pca'` as the search space

Generate a UMAP based on the reconstructed neighborhood graph.

[22]:
sc.tl.umap(adata)

Visualise the batches.

[23]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
../_images/notebook_cellhint_tutorial_integration_53_0.png

Visualise the predicted cell types from CellTypist.

[24]:
sc.pl.umap(adata, color = 'majority_voting')
../_images/notebook_cellhint_tutorial_integration_55_0.png

Overlay the ground-truth annotations onto the UMAP.

[25]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_57_0.png

Even the model does not exactly match the data (e.g., using a lung model to annotate a spleen data), this approach can be still useful as cells from the same cell type will probably be assigned the same identity by the model, therefore containing information with respect to which cells should be placed together in the neighborhood graph.

Data integration guided by CellHint harmonisation

In this section, we will harmonise cell types across datasets, and reannotate all cells into the harmonised cell types.

Cell type harmonisation is achieved through cellhint.harmonize. Refer to this notebook for details.

[26]:
alignment = cellhint.harmonize(adata, 'Dataset', 'Original_annotation')
👀 Detected PCA coordinates in the object, will use these to calculate distances
🏆 Reordering datasets
🖇 Harmonizing cell types of Dominguez Conde et al. 2022 and Madissoon et al. 2020
🖇 Harmonizing cell types of Tabula Sapiens 2022
🖇 Harmonizing cell types of He et al. 2020
🖋️ Reannotating cells
✅ Harmonization done!

Visualise the harmonisation result.

[27]:
cellhint.treeplot(alignment)
../_images/notebook_cellhint_tutorial_integration_64_0.png

This plot implies what cell types can be treated as counterparts during data integration.

Importantly, the cell reannotation information is stored as alignment.reannotation.

[28]:
alignment.reannotation
[28]:
dataset cell_type reannotation group
ID
AAACCTGCACATTTCT-1-HCATisStab7463846 Madissoon et al. 2020 T_CD8_MAIT MAIT = T_CD8_MAIT ∈ cd8-positive, alpha-beta m... Group19
AAACCTGCACCGCTAG-1-HCATisStab7463846 Madissoon et al. 2020 NK_CD160pos NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED... Group21
AAACCTGCAGTCCTTC-1-HCATisStab7463846 Madissoon et al. 2020 T_CD8_activated Trm/em_CD8 = T_CD8_activated ∈ cd8-positive, a... Group19
AAACCTGCATTGGCGC-1-HCATisStab7463846 Madissoon et al. 2020 B_mantle Naive B cells = B_mantle = naive b cell = B Ce... Group18
AAACCTGCATTTCACT-1-HCATisStab7463846 Madissoon et al. 2020 NK_CD160pos NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED... Group21
... ... ... ... ...
Spleen_cDNA_TTTGGTTTCGTCCAGG-1 He et al. 2020 CD8 T Cell TRBV4-2_high Spleen Trm/em_CD8 = T_CD8_activated ∈ cd8-positive, a... Group19
Spleen_cDNA_TTTGTCAAGGAGTTTA-1 He et al. 2020 B Cell TCL1A_high Spleen Naive B cells = B_mantle = naive b cell = B Ce... Group18
Spleen_cDNA_TTTGTCACAAGCGATG-1 He et al. 2020 NK/T Cell Spleen NK_CD16+ = NK_FCGR3Apos = nk cell ∈ NK/T Cell ... Group21
Spleen_cDNA_TTTGTCACAGGGATTG-1 He et al. 2020 NK/T Cell Spleen NK_CD16+ = NK_FCGR3Apos = nk cell ∈ NK/T Cell ... Group21
Spleen_cDNA_TTTGTCATCGTTGCCT-1 He et al. 2020 CD4 T Cell Spleen Teffector/EM_CD4 = T_CD4_conv ∈ naive thymus-d... Group19

200664 rows × 4 columns

The last two columns place all cells under one naming schema. We will leverage them for supervised data integration.

First, embed the two columns (reannotation and group) into the obs of the adata.

[29]:
adata.obs[['reannotation', 'group']] = alignment.reannotation[['reannotation', 'group']].loc[adata.obs_names]
[30]:
adata.obs.iloc[:, -2:]
[30]:
reannotation group
AAACCTGCACATTTCT-1-HCATisStab7463846 MAIT = T_CD8_MAIT ∈ cd8-positive, alpha-beta m... Group19
AAACCTGCACCGCTAG-1-HCATisStab7463846 NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED... Group21
AAACCTGCAGTCCTTC-1-HCATisStab7463846 Trm/em_CD8 = T_CD8_activated ∈ cd8-positive, a... Group19
AAACCTGCATTGGCGC-1-HCATisStab7463846 Naive B cells = B_mantle = naive b cell = B Ce... Group18
AAACCTGCATTTCACT-1-HCATisStab7463846 NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED... Group21
... ... ...
Spleen_cDNA_TTTGGTTTCGTCCAGG-1 Trm/em_CD8 = T_CD8_activated ∈ cd8-positive, a... Group19
Spleen_cDNA_TTTGTCAAGGAGTTTA-1 Naive B cells = B_mantle = naive b cell = B Ce... Group18
Spleen_cDNA_TTTGTCACAAGCGATG-1 NK_CD16+ = NK_FCGR3Apos = nk cell ∈ NK/T Cell ... Group21
Spleen_cDNA_TTTGTCACAGGGATTG-1 NK_CD16+ = NK_FCGR3Apos = nk cell ∈ NK/T Cell ... Group21
Spleen_cDNA_TTTGTCATCGTTGCCT-1 Teffector/EM_CD4 = T_CD4_conv ∈ naive thymus-d... Group19

200664 rows × 2 columns

Integrate cells using cellhint.integrate, with donor_id as the batch confounder and reannotation as the cell annotation.

[31]:
cellhint.integrate(adata, 'donor_id', 'reannotation')
👀 `use_rep` is not specified, will use `'X_pca'` as the search space
[32]:
sc.tl.umap(adata)

Visualise the batches.

[33]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
... storing 'reannotation' as categorical
... storing 'group' as categorical
../_images/notebook_cellhint_tutorial_integration_75_1.png

Overlay the ground-truth annotations onto the UMAP.

[34]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_77_0.png

Alternatively, you can set the group (i.e., high-hierarchy cell types) as the cell annotation.

[35]:
cellhint.integrate(adata, 'donor_id', 'group')
sc.tl.umap(adata)
👀 `use_rep` is not specified, will use `'X_pca'` as the search space

Visualise the batches.

[36]:
sc.pl.umap(adata, color = ['Dataset', 'donor_id'], wspace = 0.5)
../_images/notebook_cellhint_tutorial_integration_81_0.png

Overlay the ground-truth annotations onto the UMAP.

[37]:
sc.pl.umap(adata, color = 'Curated_annotation')
../_images/notebook_cellhint_tutorial_integration_83_0.png

Examine the distribution of high- and low-hierarchy cell types.

[38]:
sc.pl.umap(adata, color = 'group', legend_loc = 'on data')
sc.pl.umap(adata, color = 'reannotation')
../_images/notebook_cellhint_tutorial_integration_85_0.png
../_images/notebook_cellhint_tutorial_integration_85_1.png

Lastly, as an example, we manually inspect some cell types.

Identify the components of a high-hierarchy cell type Group21.

[39]:
alignment.relation[alignment.groups == 'Group21']
[39]:
Dominguez Conde et al. 2022 relation Madissoon et al. 2020 relation Tabula Sapiens 2022 relation He et al. 2020
6 NK_CD16+ = NK_FCGR3Apos = nk cell NK/T Cell Spleen
7 NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED NK/T Cell Spleen
8 Tem/emra_CD8 = T_CD8_CTL = naive thymus-derived cd8-positive, alpha-beta ... NK/T Cell Spleen

This table shows that three low-hierarchy cell types (corresponding to three rows) collectively constitute the high-hierarchy cell type Group21.

[40]:
cellhint.treeplot(alignment.relation[alignment.groups == 'Group21'], figsize = [15, 4])
../_images/notebook_cellhint_tutorial_integration_90_0.png

Find out the three low-hierarchy cell types.

[41]:
import numpy as np
low_cell_types = np.unique(adata.obs.reannotation[adata.obs.group == 'Group21'])
low_cell_types
[41]:
array(['NK_CD16+ = NK_FCGR3Apos = nk cell ∈ NK/T Cell Spleen',
       'NK_CD56bright_CD16- = NK_CD160pos = UNRESOLVED ∈ NK/T Cell Spleen',
       'Tem/emra_CD8 = T_CD8_CTL = naive thymus-derived cd8-positive, alpha-beta t cell ∈ NK/T Cell Spleen'],
      dtype=object)

Plot the distribution of the three cell types in the UMAP.

[42]:
sc.pl.umap(adata, color = 'group', groups = 'Group21', size = 5)
sc.pl.umap(adata, color = 'reannotation', groups = list(low_cell_types), size = 5)
../_images/notebook_cellhint_tutorial_integration_94_0.png
../_images/notebook_cellhint_tutorial_integration_94_1.png