Tree structure

class cellhint.pct.PredictiveClusteringTree(*, max_depth: int | None = None, min_samples_split: int | float = 20, min_samples_leaf: int | float = 10, min_weight_fraction_leaf: float = 0.0, random_state: int | None = None, max_leaf_nodes: int | None = None, F_test_prune: bool = True, p_thres: float = 0.05)[source]

Bases: DecisionTreeRegressor

Class that uses predictive clustering trees (PCT) for multi-output prediction. Note this is a specialized PCT with the prototype as the mean vector and the distance measure as the (sum of) intra-cluster variance. Such a PCT is equivalent to CART regression tree (with specialized parameters and post-pruning).

Parameters:
  • max_depth – Maximum possible depth of the tree, starting from the root node which has a depth of 0. Default to no limit.

  • min_samples_split – The minimum sample size (in absolute number or fraction) of a possible node. (Default: 20)

  • min_samples_leaf – The minimum sample size (in absolute number or fraction) of a possible leaf. (Default: 10)

  • min_weight_fraction_leaf – The minimum fraction out of total sample weights for a possible leaf. (Default: 0.0)

  • random_state – Random seed for column (feature) shuffling before selecting the best feature and threshold.

  • max_leaf_nodes – The maximum number of leaves, achieved by keeping high-quality (i.e., high impurity reduction) nodes. Default to no limit.

  • F_test_prune – Whether to use a F-test to prune the tree by removing unnecessary splits. (Default: True)

  • p_thres – p-value threshold for pruning nodes after F-test. (Default: 0.05)

n_features_in_

Number of features.

n_outputs_

Number of outputs.

tree_

A Tree object structured by parallel arrays.

p_value

F-test-based p-value for each node or leaf in the tree.

F_test() None[source]

F test for each internal node. For each node, the corresponding F distribution has the degrees of freedom n_output * (n_sample - 1) and n_output * (n_sample - 2), and the value (q) of node_impurity * n_sample / (n_sample - 1) divided by (left_child_impurity * left_n_sample + right_child_impurity * right_n_sample) / (n_sample - 2).

Returns:

Modified tree with F-test p-values. Leaves are assigned 1 constantly.

Return type:

None

fit(X, y, sample_weight=None) None[source]

Fit a PCT with the training dataset.

Parameters:
  • X – Sample-by-feature array-like matrix.

  • y – Sample-by-output array-like matrix.

  • sample_weight – Sample weights. Default to equal sample weights.

Returns:

Fitted and (possibly) pruned tree.

Return type:

None

is_node(index: int) bool[source]

Check whether a given index is a node.

Parameters:

index – Index of the node/leaf in the arrays of the tree structure.

Returns:

True or False indicating whether the given index is an internal node.

Return type:

bool

prune_node(index: int) None[source]

Prune all descendents of a given node. This node will become a leaf.

Parameters:

index – Index of the node/leaf in the tree structure.

Returns:

Modified tree with all descendents of a given node pruned.

Return type:

None

prune_tree(p_thres: float = 0.05) None[source]

Prune a tree based on F-test p values.

Parameters:

p_thres – p-value threshold to prune nodes. (Default: 0.05)

Returns:

Modified tree with unnecessary splits removed.

Return type:

None

score(X, y, sample_weight=None) float[source]

Calculate the coefficient of determination between the prediction and truth. Different from multi-output problem where each output is calculated separately and the final R2 score is averaged across outputs, the score here is defined by considering each sample vector as a ‘real’ sample.

Parameters:
  • X – Sample-by-feature query matrix.

  • y – Sample-by-output truth matrix.

  • sample_weight – Sample weights applied to squared distance of each sample.

Returns:

Coefficient of determination.

Return type:

float