Harmonisation result

class cellhint.align.DistanceAlignment(base_distance: Distance, check: bool = True, dataset_order: list | tuple | ndarray | Series | Index | None = None, row_normalize: bool = True, minimum_unique_percent: float = 0.5, minimum_divide_percent: float = 0.1, maximum_novel_percent: float = 0.05)[source]

Bases: object

Class that performs cell type label harmonization across datasets.

Parameters:
  • base_distance – A Distance object.

  • check – Whether to check the supplied base_distance is correctly provided. (Default: True)

  • dataset_order – Order of datasets to be aligned. By default, the order is the same as that in the base distance matrix.

  • row_normalize – Whether to row normalize the confusion matrix to a sum of 1 in each iteration. (Default: True)

  • minimum_unique_percent – The minimum cell assignment fraction to claim a cell type as uniquely matched to a cell type from the other dataset. (Default: 0.5)

  • minimum_divide_percent – The minimum cell assignment fraction to claim a cell type as divisible into two or more cell types from the other dataset. (Default: 0.1)

  • maximum_novel_percent – The maximum cell assignment fraction to claim a cell type as novel to a given dataset. (Default: 0.05)

base_distance

The Distance object.

dataset_order

Order of datasets to be aligned.

row_normalize

Whether to row normalize the confusion matrix to a sum of 1 in each iteration.

minimum_unique_percent

The minimum cell assignment fraction to claim a cell type as uniquely matched to a cell type from the other dataset.

minimum_divide_percent

The minimum cell assignment fraction to claim a cell type as divisible into two or more cell types from the other dataset.

maximum_novel_percent

The maximum cell assignment fraction to claim a cell type as novel to a given dataset.

relation

A DataFrame representing the harmonization result.

aligned_datasets

List of datasets that are already harmonized.

groups

Cell type groups (high-hierarchy cell types) categorizing the rows of .relation.

reannotation

A DataFrame representing the reannotated cell types.

minimum_unique_percents

List of minimum_unique_percent values which are used along harmonization iterations in order to get the best alignment. This attribute is obtained through the best_align() function.

minimum_divide_percents

List of minimum_divide_percent values which are used along harmonization iterations in order to get the best alignment. This attribute is obtained through the best_align() function.

align(datasets: list | tuple | ndarray | Series | Index | None = None) None[source]

Iterative alignment of cell types across datasets.

Parameters:

datasets – Datasets to be aligned. Default to using all datasets available.

Returns:

A DataFrame with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.

Return type:

None

property aligned_datasets: ndarray

Get the datasets which are already harmonized.

best_align(dataset_order: list | tuple | ndarray | Series | Index | None = None, minimum_unique_percents: list | tuple | ndarray | Series | Index | float = (0.4, 0.5, 0.6, 0.7, 0.8), minimum_divide_percents: list | tuple | ndarray | Series | Index | float = (0.1, 0.15, 0.2))[source]

Iterative alignment of cell types across datasets by finding the best parameter combo in each iteration.

Parameters:
  • dataset_order – Order of datasets to be aligned. This can also be a subset of datasets. Default to the dataset order in the DistanceAlignment object.

  • minimum_unique_percents – The minimum cell assignment fraction(s) to claim a cell type as uniquely matched to a cell type from the other dataset. By default, five values will be tried (0.4, 0.5, 0.6, 0.7, 0.8) to find the one that produces least alignments in each harmonization iteration.

  • minimum_divide_percents – The minimum cell assignment fraction(s) to claim a cell type as divisible into two or more cell types from the other dataset. By default, three values will be tried (0.1, 0.15, 0.2) to find the one that produces least alignments in each harmonization iteration.

Returns:

A DataFrame with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.

Return type:

None

property groups: ndarray

Get the cell type groups (high hierarchy) based on the relation table.

static load(alignment_file: str)[source]

Load the DistanceAlignment file.

multi_align(relation: DataFrame, D: str, check: bool = True) DataFrame[source]

Multiple alignment of cell types across datasets. Cell types from a new dataset will be integrated into the previous harmonization data frame.

Parameters:
  • relation – A DataFrame object representing the cell type harmonization result across multiple datasets.

  • D – Name of the new dataset to be aligned.

  • check – Whether to check names of the datasets are contained. (Default: True)

Returns:

A DataFrame with multiple columns: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the new dataset, cell types from the new dataset.

Return type:

DataFrame

pairwise_align(D1: str, D2: str, check: bool = True) DataFrame[source]

Pairwise alignment of cell types between two datasets.

Parameters:
  • D1 – Name of the first dataset.

  • D2 – Name of the second dataset.

  • check – Whether to check names of the two datasets are contained in the base_distance. (Default: True)

Returns:

A DataFrame with three columns: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2.

Return type:

DataFrame

reannotate(prefix: str = '') None[source]

Reannotate each cell into the harmonized cell type.

Parameters:

prefix – Prefix of the harmonization columns. Default to no prefix.

Returns:

A DataFrame with multiple columns added as the attribute .reannotation: 1) dataset, datasets where the cells are from. 2) cell_type, cell types annotated by the original studies/datasets. 3) reannotation, prefixed with prefix; cell types reannotated by the harmonization process. 4) group, prefixed with prefix; annotated cell type groups.

Return type:

None

reorder_dataset(weights: list | tuple | ndarray | Series | Index = (2, 1, -1, -2), return_similarity: bool = False) None | DataFrame[source]

Reorder the datasets such that similar datasets will be harmonized first. This method can also be used to calculate CellHint-defined inter-dataset similarities.

Parameters:
  • weights – Weights assigned to one-to-one, one/many-to-many/one, novel, and remaining cell type matches, respectively. Default to 2, 1, -1, -2. Inter-cell-type similarities will be weighted by these values to derive the weighted sum of similarity between each pair of datasets.

  • return_similarity – Whether to return the data frame of dataset-dataset similarities. (Default: False)

Return type:

Reordered datasets as the attribute .dataset_order and if return_similarity = True, return a DataFrame of dataset-dataset similarities.

update(datasets: str | list | tuple | ndarray | Series | Index | None = None) None[source]

Iteratively update the alignment of cell types across datasets.

Parameters:

datasets – Datasets to be aligned. Default to using all the remaining datasets.

Returns:

An updated DataFrame with multiple columns added as the attribute .relation: 1) name of dataset 1, cell types from dataset 1. 2) relation, being either ‘=’, ‘∋’ or ‘∈’. 3) name of dataset 2, cell types from dataset 2. 4) … N) name of the last dataset, cell types from the last dataset.

Return type:

None

write(file: str) None[source]

Write out the DistanceAlignment.