Distance structure

class cellhint.distance.Distance(dist_mat: ndarray, cell: DataFrame, cell_type: DataFrame)[source]

Bases: object

Class that deals with the cross-dataset cell-by-cell-type distance matrix.

Parameters:

dist_mat – Cell-by-cell-type distance matrix.
cell – Cell meta-information including at least ‘dataset’, ‘ID’ and ‘cell_type’.
cell_type – Cell type meta-information including at least ‘dataset’ and ‘cell_type’.

dist_mat: A cell-by-cell-type distance matrix.

cell: Cell meta-information including ‘dataset’, ‘ID’ and ‘cell_type’.

cell_type: Cell type meta-information including ‘dataset’ and ‘cell_type’.

n_cell: Number of cells involved.

n_cell_type: Number of cell types involved.

shape: Tuple of number of cells and cell types.

assignment: Assignment of each cell to the most similar cell type in each dataset (obtained through the assign method).

assign() → None[source]

Assign each cell to its most similar cell type in each dataset.

Returns:: Modified object with the result of cell assignment added as .assignment.
Return type:: None

concatenate(*distances, by: str = 'cell', check: bool = False)[source]

Concatenate by either cells (rows) or cell types (columns).

Parameters:

distances – A Distance object or a list of such objects.
by – The direction of concatenation, joining either cells (‘cell’, rows) or cell types (‘cell_type’, columns). (Default: ‘cell’)
check – Check whether the concatenation is feasible. (Default: False)

Returns:

A Distance object concatenated along cells (by = ‘cell’) or cell types (by = ‘cell_type’).

Return type:

Distance

filter_cells(check_symmetry: bool = True) → None[source]

Filter out cells whose gene expression profiles do not correlate most with the eigen cell they belong to (i.e., correlate most with other cell types).

Parameters:: check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)
Returns:: A Distance object with undesirable cells filtered out.
Return type:: None

static from_adata(adata: AnnData, dataset: str, cell_type: str, use_rep: str | None = None, metric: str | None = None, n_jobs: int | None = None, check_params: bool = True, **kwargs)[source]

Generate a Distance object from the AnnData given.

Parameters:

adata – An AnnData object containing different datasets/batches and cell types. In most scenarios, the format of the expression .X in the AnnData is flexible (normalized, log-normalized, z-scaled, etc.). However, when use_rep is specified as ‘X’ (or X_pca is not detected in .obsm and no other latent representations are provided), .X should be log-normalized (to a constant total count per cell).
dataset – Column name (key) of cell metadata specifying dataset information.
cell_type – Column name (key) of cell metadata specifying cell type information.
use_rep – Representation used to calculate distances. This can be ‘X’ or any representations stored in .obsm. Default to the PCA coordinates if present (if not, use the expression matrix X).
metric – Metric to calculate the distance between each cell and each cell type. Can be ‘euclidean’, ‘cosine’, ‘manhattan’ or any metrics applicable to sklearn.metrics.pairwise_distances(). Default to ‘euclidean’ if latent representations are used for calculating distances, and to ‘correlation’ if the expression matrix is used.
n_jobs – Number of CPUs used. Default to one CPU. -1 means all CPUs are used.
check_params – Whether to check (or set the default) for dataset, cell_type, use_rep and metric. (Default: True)
**kwargs – Other keyword arguments passed to sklearn.metrics.pairwise_distances().

Returns:

A Distance object representing the cross-dataset cell-by-cell-type distance matrix.

Return type:

Distance

property n_cell: int: Number of cells.

property n_cell_type: int: Number of cell types.

normalize(Gaussian_kernel: bool = False, rank: bool = True, normalize: bool = True) → None[source]

Normalize the distance matrix with a Gaussian kernel.

Parameters:

Gaussian_kernel – Whether to apply the Gaussian kernel to the distance matrix. (Default: False)
rank – Whether to turn the matrix into a rank matrx. (Default: True)
normalize – Whether to maximum-normalize the distance matrix. (Default: True)

Returns:

The Distance object modified with a normalized distance matrix.

Return type:

None

property shape: tuple: Numbers of cells and cell types.

symmetric() → bool[source]

Check whether the distance matrix is symmetric in terms of datasets and cell types.

Returns:: True or False indicating whether all datasets and cell types are included in the object (thus symmetric).
Return type:: bool

to_binary(check_symmetry: bool = True)[source]

Turn the distance matrix into a binary matrix representing the estimated cell type membership across datasets.

Parameters:: check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)
Returns:: A Distance object representing the estimated cell type membership across datasets.
Return type:: Distance

to_confusion(D1: str, D2: str, check: bool = True) → tuple[source]

This function is deprecated. Use to_pairwise_confusion and to_multi_confusion instead. Extract the dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices. Note this function is expected to be applied to a binary membership matrix.

Parameters:

D1 – Name of the first dataset.
D2 – Name of the second dataset.
check – Whether to check names of the two datasets are contained. (Default: True)

Returns:

The dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Return type:

tuple

to_meta(check_symmetry: bool = True, turn_binary: bool = False, return_symmetry: bool = True) → DataFrame[source]

Meta-analysis of cross-dataset cell type dissimilarity or membership.

Parameters:

check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)
turn_binary – Whether to turn the distance matrix into a cell type membership matrix before meta analysis. (Default: False)
return_symmetry – Whether to return a symmetric dissimilarity matrix by averaging with its transposed form. (Default: True)

Returns:

A DataFrame object representing the cell-type-level dissimilarity matrix (turn_binary = False) or membership matrix (turn_binary = True).

Return type:

DataFrame

to_multi_confusion(relation: DataFrame, D: str, check: bool = True) → tuple[source]

Extract the confusion matrices between meta-cell-types defined prior and cell types from a new dataset.

Parameters:

relation – A DataFrame object representing the cell type harmonization result across multiple datasets.
D – Name of the new dataset to be aligned.
check – Whether to check names of the datasets are contained. (Default: True)

Returns:

The confusion matrices between meta-cell-types defined prior and cell types from a new dataset.

Return type:

tuple

to_pairwise_confusion(D1: str, D2: str, check: bool = True) → tuple[source]

Extract the dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Parameters:

D1 – Name of the first dataset.
D2 – Name of the second dataset.
check – Whether to check names of the two datasets are contained. (Default: True)

Returns:

The dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Return type:

tuple