Distance structure

class cellhint.distance.Distance(dist_mat: ndarray, cell: DataFrame, cell_type: DataFrame)[source]

Bases: object

Class that deals with the cross-dataset cell-by-cell-type distance matrix.

Parameters:
  • dist_mat – Cell-by-cell-type distance matrix.

  • cell – Cell meta-information including at least ‘dataset’, ‘ID’ and ‘cell_type’.

  • cell_type – Cell type meta-information including at least ‘dataset’ and ‘cell_type’.

dist_mat

A cell-by-cell-type distance matrix.

cell

Cell meta-information including ‘dataset’, ‘ID’ and ‘cell_type’.

cell_type

Cell type meta-information including ‘dataset’ and ‘cell_type’.

n_cell

Number of cells involved.

n_cell_type

Number of cell types involved.

shape

Tuple of number of cells and cell types.

assignment

Assignment of each cell to the most similar cell type in each dataset (obtained through the assign method).

assign() None[source]

Assign each cell to its most similar cell type in each dataset.

Returns:

Modified object with the result of cell assignment added as .assignment.

Return type:

None

concatenate(*distances, by: str = 'cell', check: bool = False)[source]

Concatenate by either cells (rows) or cell types (columns).

Parameters:
  • distances – A Distance object or a list of such objects.

  • by – The direction of concatenation, joining either cells (‘cell’, rows) or cell types (‘cell_type’, columns). (Default: ‘cell’)

  • check – Check whether the concatenation is feasible. (Default: False)

Returns:

A Distance object concatenated along cells (by = ‘cell’) or cell types (by = ‘cell_type’).

Return type:

Distance

filter_cells(check_symmetry: bool = True) None[source]

Filter out cells whose gene expression profiles do not correlate most with the eigen cell they belong to (i.e., correlate most with other cell types).

Parameters:

check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)

Returns:

A Distance object with undesirable cells filtered out.

Return type:

None

static from_adata(adata: AnnData, dataset: str, cell_type: str, use_rep: str | None = None, metric: str | None = None, n_jobs: int | None = None, check_params: bool = True, **kwargs)[source]

Generate a Distance object from the AnnData given.

Parameters:
  • adata – An AnnData object containing different datasets/batches and cell types. In most scenarios, the format of the expression .X in the AnnData is flexible (normalized, log-normalized, z-scaled, etc.). However, when use_rep is specified as ‘X’ (or X_pca is not detected in .obsm and no other latent representations are provided), .X should be log-normalized (to a constant total count per cell).

  • dataset – Column name (key) of cell metadata specifying dataset information.

  • cell_type – Column name (key) of cell metadata specifying cell type information.

  • use_rep – Representation used to calculate distances. This can be ‘X’ or any representations stored in .obsm. Default to the PCA coordinates if present (if not, use the expression matrix X).

  • metric – Metric to calculate the distance between each cell and each cell type. Can be ‘euclidean’, ‘cosine’, ‘manhattan’ or any metrics applicable to sklearn.metrics.pairwise_distances(). Default to ‘euclidean’ if latent representations are used for calculating distances, and to ‘correlation’ if the expression matrix is used.

  • n_jobs – Number of CPUs used. Default to one CPU. -1 means all CPUs are used.

  • check_params – Whether to check (or set the default) for dataset, cell_type, use_rep and metric. (Default: True)

  • **kwargs – Other keyword arguments passed to sklearn.metrics.pairwise_distances().

Returns:

A Distance object representing the cross-dataset cell-by-cell-type distance matrix.

Return type:

Distance

property n_cell: int

Number of cells.

property n_cell_type: int

Number of cell types.

normalize(Gaussian_kernel: bool = False, rank: bool = True, normalize: bool = True) None[source]

Normalize the distance matrix with a Gaussian kernel.

Parameters:
  • Gaussian_kernel – Whether to apply the Gaussian kernel to the distance matrix. (Default: False)

  • rank – Whether to turn the matrix into a rank matrx. (Default: True)

  • normalize – Whether to maximum-normalize the distance matrix. (Default: True)

Returns:

The Distance object modified with a normalized distance matrix.

Return type:

None

property shape: tuple

Numbers of cells and cell types.

symmetric() bool[source]

Check whether the distance matrix is symmetric in terms of datasets and cell types.

Returns:

True or False indicating whether all datasets and cell types are included in the object (thus symmetric).

Return type:

bool

to_binary(check_symmetry: bool = True)[source]

Turn the distance matrix into a binary matrix representing the estimated cell type membership across datasets.

Parameters:

check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)

Returns:

A Distance object representing the estimated cell type membership across datasets.

Return type:

Distance

to_confusion(D1: str, D2: str, check: bool = True) tuple[source]

This function is deprecated. Use to_pairwise_confusion and to_multi_confusion instead. Extract the dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices. Note this function is expected to be applied to a binary membership matrix.

Parameters:
  • D1 – Name of the first dataset.

  • D2 – Name of the second dataset.

  • check – Whether to check names of the two datasets are contained. (Default: True)

Returns:

The dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Return type:

tuple

to_meta(check_symmetry: bool = True, turn_binary: bool = False, return_symmetry: bool = True) DataFrame[source]

Meta-analysis of cross-dataset cell type dissimilarity or membership.

Parameters:
  • check_symmetry – Whether to check the symmetry of the distance matrix in terms of datasets and cell types. (Default: True)

  • turn_binary – Whether to turn the distance matrix into a cell type membership matrix before meta analysis. (Default: False)

  • return_symmetry – Whether to return a symmetric dissimilarity matrix by averaging with its transposed form. (Default: True)

Returns:

A DataFrame object representing the cell-type-level dissimilarity matrix (turn_binary = False) or membership matrix (turn_binary = True).

Return type:

DataFrame

to_multi_confusion(relation: DataFrame, D: str, check: bool = True) tuple[source]

Extract the confusion matrices between meta-cell-types defined prior and cell types from a new dataset.

Parameters:
  • relation – A DataFrame object representing the cell type harmonization result across multiple datasets.

  • D – Name of the new dataset to be aligned.

  • check – Whether to check names of the datasets are contained. (Default: True)

Returns:

The confusion matrices between meta-cell-types defined prior and cell types from a new dataset.

Return type:

tuple

to_pairwise_confusion(D1: str, D2: str, check: bool = True) tuple[source]

Extract the dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Parameters:
  • D1 – Name of the first dataset.

  • D2 – Name of the second dataset.

  • check – Whether to check names of the two datasets are contained. (Default: True)

Returns:

The dataset1-by-dataset2 and dataset2-by-dataset1 confusion matrices.

Return type:

tuple