Tree structure

class cellhint.pct.PredictiveClusteringTree(*, max_depth: int | None = None, min_samples_split: int | float = 20, min_samples_leaf: int | float = 10, min_weight_fraction_leaf: float = 0.0, random_state: int | None = None, max_leaf_nodes: int | None = None, F_test_prune: bool = True, p_thres: float = 0.05)[source]

Bases: DecisionTreeRegressor

Class that uses predictive clustering trees (PCT) for multi-output prediction. Note this is a specialized PCT with the prototype as the mean vector and the distance measure as the (sum of) intra-cluster variance. Such a PCT is equivalent to CART regression tree (with specialized parameters and post-pruning).

Parameters:
  • max_depth – Maximum possible depth of the tree, starting from the root node which has a depth of 0. Default to no limit.

  • min_samples_split – The minimum sample size (in absolute number or fraction) of a possible node. (Default: 20)

  • min_samples_leaf – The minimum sample size (in absolute number or fraction) of a possible leaf. (Default: 10)

  • min_weight_fraction_leaf – The minimum fraction out of total sample weights for a possible leaf. (Default: 0.0)

  • random_state – Random seed for column (feature) shuffling before selecting the best feature and threshold.

  • max_leaf_nodes – The maximum number of leaves, achieved by keeping high-quality (i.e., high impurity reduction) nodes. Default to no limit.

  • F_test_prune – Whether to use a F-test to prune the tree by removing unnecessary splits. (Default: True)

  • p_thres – p-value threshold for pruning nodes after F-test. (Default: 0.05)

n_features_in_

Number of features.

n_outputs_

Number of outputs.

tree_

A Tree object structured by parallel arrays.

p_value

F-test-based p-value for each node or leaf in the tree.

F_test() None[source]

F test for each internal node. For each node, the corresponding F distribution has the degrees of freedom n_output * (n_sample - 1) and n_output * (n_sample - 2), and the value (q) of node_impurity * n_sample / (n_sample - 1) divided by (left_child_impurity * left_n_sample + right_child_impurity * right_n_sample) / (n_sample - 2).

Returns:

Modified tree with F-test p-values. Leaves are assigned 1 constantly.

Return type:

None

fit(X, y, sample_weight=None) None[source]

Fit a PCT with the training dataset.

Parameters:
  • X – Sample-by-feature array-like matrix.

  • y – Sample-by-output array-like matrix.

  • sample_weight – Sample weights. Default to equal sample weights.

Returns:

Fitted and (possibly) pruned tree.

Return type:

None

is_node(index: int) bool[source]

Check whether a given index is a node.

Parameters:

index – Index of the node/leaf in the arrays of the tree structure.

Returns:

True or False indicating whether the given index is an internal node.

Return type:

bool

prune_node(index: int) None[source]

Prune all descendents of a given node. This node will become a leaf.

Parameters:

index – Index of the node/leaf in the tree structure.

Returns:

Modified tree with all descendents of a given node pruned.

Return type:

None

prune_tree(p_thres: float = 0.05) None[source]

Prune a tree based on F-test p values.

Parameters:

p_thres – p-value threshold to prune nodes. (Default: 0.05)

Returns:

Modified tree with unnecessary splits removed.

Return type:

None

score(X, y, sample_weight=None) float[source]

Calculate the coefficient of determination between the prediction and truth. Different from multi-output problem where each output is calculated separately and the final R2 score is averaged across outputs, the score here is defined by considering each sample vector as a ‘real’ sample.

Parameters:
  • X – Sample-by-feature query matrix.

  • y – Sample-by-output truth matrix.

  • sample_weight – Sample weights applied to squared distance of each sample.

Returns:

Coefficient of determination.

Return type:

float

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') PredictiveClusteringTree

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, check_input: bool | None | str = '$UNCHANGED$') PredictiveClusteringTree

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

check_input (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for check_input parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') PredictiveClusteringTree

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object