Harmonisation function

PCT-based cell type harmonization across datasets/batches.

Parameters:

adata – An AnnData object containing different datasets/batches and cell types. In most scenarios, the format of the expression .X in the AnnData is flexible (normalized, log-normalized, z-scaled, etc.). However, when use_rep is specified as ‘X’ (or X_pca is not detected in .obsm and no other latent representations are provided), .X should be log-normalized (to a constant total count per cell).
dataset – Column name (key) of cell metadata specifying dataset/batch information.
cell_type – Column name (key) of cell metadata specifying cell type information.
use_rep – Representation used to calculate distances. This can be ‘X’ or any representations stored in .obsm. Default to the PCA coordinates if present (if not, use the expression matrix X).
metric – Metric to calculate the distance between each cell and each cell type. Can be ‘euclidean’, ‘cosine’, ‘manhattan’ or any metrics applicable to sklearn.metrics.pairwise_distances(). Default to ‘euclidean’ if latent representations are used for calculating distances, and to ‘correlation’ if the expression matrix is used.
use_pct – Whether to use a predictive clustering tree to infer cross-dataset cell type distances. Setting to True will calculate distances based on PCT, which is intended for datasets with large batch effects. (Default: False)
filter_cells – Whether to filter out cells whose gene expression profiles do not correlate most with the eigen cell they belong to (i.e., correlate most with other cell types). Setting to True will speed up the run as only a subset of cells are used, but will render the remaining cells (i.e., filtered cells) unannotated (see the reannotate argument). (Default: False)
normalize – Whether to normalize the distance matrix if use_pct = False (or normalize the predicted distance if use_pct = True). (Default: True)
Gaussian_kernel – Whether to apply the Gaussian kernel to the distance matrix. (Default: False)
F_test_prune – Whether to use a F-test to prune the tree by removing unnecessary splits. (Default: True)
p_thres – p-value threshold for pruning nodes after F-test. (Default: 0.05)
random_state – Random seed for feature shuffling during PCT training. (Default: 2)
dataset_order – Order of datasets to be aligned. If this argument is specified, reorder_dataset is ignored. Default to the order in the distance matrix (alphabetical order in most cases) if reorder_dataset = False.
reorder_dataset – Whether to reorder datasets based on their pairwise similarities. (Default: True)
minimum_unique_percents – The minimum cell assignment fraction(s) to claim a cell type as uniquely matched to a cell type from the other dataset. By default, five values will be tried (0.4, 0.5, 0.6, 0.7, 0.8) to find the one that produces least alignments in each harmonization iteration.
minimum_divide_percents – The minimum cell assignment fraction(s) to claim a cell type as divisible into two or more cell types from the other dataset. By default, three values will be tried (0.1, 0.15, 0.2) to find the one that produces least alignments in each harmonization iteration.
maximum_novel_percent – The maximum cell assignment fraction to claim a cell type as novel to a given dataset. (Default: 0.05)
reannotate – Whether to reannotate cells into harmonized cell types. (Default: True)
prefix – Column prefix for the reannotation data frame.
**kwargs – Other keyword arguments passed to PredictiveClusteringTree.

Returns:

A DistanceAlignment object. Four important attributes within this class are: 1) base_distance, cross-dataset distances between all cells and all cell types. 2) relation, the harmonization table. 3) groups, high-hierarchy cell types categorizing rows of the harmonization table. 4) reannotation, reannotated cell types and cell type groups.

Return type:

DistanceAlignment