Harmonisation result
- class cellhint.align.DistanceAlignment(base_distance: Distance, check: bool = True, dataset_order: list | tuple | ndarray | Series | Index | None = None, row_normalize: bool = True, minimum_unique_percent: float = 0.5, minimum_divide_percent: float = 0.1, maximum_novel_percent: float = 0.05)[source]
Bases:
object
Class that performs cell type label harmonization across datasets.
- Parameters:
base_distance – A
Distance
object.check – Whether to check the supplied base_distance is correctly provided. (Default: True)
dataset_order – Order of datasets to be aligned. By default, the order is the same as that in the base distance matrix.
row_normalize – Whether to row normalize the confusion matrix to a sum of 1 in each iteration. (Default: True)
minimum_unique_percent – The minimum cell assignment fraction to claim a cell type as uniquely matched to a cell type from the other dataset. (Default: 0.5)
minimum_divide_percent – The minimum cell assignment fraction to claim a cell type as divisible into two or more cell types from the other dataset. (Default: 0.1)
maximum_novel_percent – The maximum cell assignment fraction to claim a cell type as novel to a given dataset. (Default: 0.05)
- dataset_order
Order of datasets to be aligned.
- row_normalize
Whether to row normalize the confusion matrix to a sum of 1 in each iteration.
- minimum_unique_percent
The minimum cell assignment fraction to claim a cell type as uniquely matched to a cell type from the other dataset.
- minimum_divide_percent
The minimum cell assignment fraction to claim a cell type as divisible into two or more cell types from the other dataset.
- maximum_novel_percent
The maximum cell assignment fraction to claim a cell type as novel to a given dataset.
- aligned_datasets
List of datasets that are already harmonized.
- groups
Cell type groups (high-hierarchy cell types) categorizing the rows of .relation.
- minimum_unique_percents
List of minimum_unique_percent values which are used along harmonization iterations in order to get the best alignment. This attribute is obtained through the
best_align()
function.
- minimum_divide_percents
List of minimum_divide_percent values which are used along harmonization iterations in order to get the best alignment. This attribute is obtained through the
best_align()
function.
- align(datasets: list | tuple | ndarray | Series | Index | None = None) None [source]
Iterative alignment of cell types across datasets.
- Parameters:
datasets – Datasets to be aligned. Default to using all datasets available.
- Returns:
A
DataFrame
with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.- Return type:
None
- best_align(dataset_order: list | tuple | ndarray | Series | Index | None = None, minimum_unique_percents: list | tuple | ndarray | Series | Index | float = (0.4, 0.5, 0.6, 0.7, 0.8), minimum_divide_percents: list | tuple | ndarray | Series | Index | float = (0.1, 0.15, 0.2))[source]
Iterative alignment of cell types across datasets by finding the best parameter combo in each iteration.
- Parameters:
dataset_order – Order of datasets to be aligned. This can also be a subset of datasets. Default to the dataset order in the DistanceAlignment object.
minimum_unique_percents – The minimum cell assignment fraction(s) to claim a cell type as uniquely matched to a cell type from the other dataset. By default, five values will be tried (0.4, 0.5, 0.6, 0.7, 0.8) to find the one that produces least alignments in each harmonization iteration.
minimum_divide_percents – The minimum cell assignment fraction(s) to claim a cell type as divisible into two or more cell types from the other dataset. By default, three values will be tried (0.1, 0.15, 0.2) to find the one that produces least alignments in each harmonization iteration.
- Returns:
A
DataFrame
with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.- Return type:
None
- multi_align(relation: DataFrame, D: str, check: bool = True) DataFrame [source]
Multiple alignment of cell types across datasets. Cell types from a new dataset will be integrated into the previous harmonization data frame.
- Parameters:
relation – A
DataFrame
object representing the cell type harmonization result across multiple datasets.D – Name of the new dataset to be aligned.
check – Whether to check names of the datasets are contained. (Default: True)
- Returns:
A
DataFrame
with multiple columns: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the new dataset, cell types from the new dataset.- Return type:
- pairwise_align(D1: str, D2: str, check: bool = True) DataFrame [source]
Pairwise alignment of cell types between two datasets.
- Parameters:
D1 – Name of the first dataset.
D2 – Name of the second dataset.
check – Whether to check names of the two datasets are contained in the
base_distance
. (Default: True)
- Returns:
A
DataFrame
with three columns: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2.- Return type:
- reannotate(prefix: str = '') None [source]
Reannotate each cell into the harmonized cell type.
- Parameters:
prefix – Prefix of the harmonization columns. Default to no prefix.
- Returns:
A
DataFrame
with multiple columns added as the attribute .reannotation: 1) dataset, datasets where the cells are from. 2) cell_type, cell types annotated by the original studies/datasets. 3) reannotation, prefixed with prefix; cell types reannotated by the harmonization process. 4) group, prefixed with prefix; annotated cell type groups.- Return type:
None
- reorder_dataset(weights: list | tuple | ndarray | Series | Index = (2, 1, -1, -2), return_similarity: bool = False) None | DataFrame [source]
Reorder the datasets such that similar datasets will be harmonized first. This method can also be used to calculate CellHint-defined inter-dataset similarities.
- Parameters:
weights – Weights assigned to one-to-one, one/many-to-many/one, novel, and remaining cell type matches, respectively. Default to 2, 1, -1, -2. Inter-cell-type similarities will be weighted by these values to derive the weighted sum of similarity between each pair of datasets.
return_similarity – Whether to return the data frame of dataset-dataset similarities. (Default: False)
- Return type:
Reordered datasets as the attribute .dataset_order and if return_similarity = True, return a
DataFrame
of dataset-dataset similarities.
- update(datasets: str | list | tuple | ndarray | Series | Index | None = None) None [source]
Iteratively update the alignment of cell types across datasets.
- Parameters:
datasets – Datasets to be aligned. Default to using all the remaining datasets.
- Returns:
An updated
DataFrame
with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.- Return type:
None