New structure of tree builders/splitter/criterion
So right now what I've come up with is this:
A new criterion (only for mse) which will compute the stats for all the nodes. The criterion design will be as much similar as possible to the old one.
Two important data structures -
- (
node_stats
numpy double array)
The previous criterion acted on a single node and hence made data structures sum_total
, sum_right
and sum_left
to help compute mse for each node as the samples are updated for that node. All 3 arrays were 1D (for MSE) with shape (n_outputs,)
.
The difference now is that sum_total
will be a 2D (for MSE) with shape (n_outputs, n_nodes)
. Another difference is that all those arrays will not be owned (allocated and cleared) by the criterion class but by the tree builder as a single numpy double array node_stats
...
- (
split_records
array of struct / numpy array of struct)
Like the previous SplitRecord
struct but with more node-specific information added to it. (Previously best_split, cur_split were used by the splitter to store the best threshold / position and impurity statistics for the node, (where the node specific information was bound to splitter).
Now, the split_records
will be an array of structs. Each struct is an extended SplitRecord
-like structure which will store the best / current and node specific information like start / stop into each struct for that node. Note that this array contains split/node information for all those nodes that are not leaf at the current depth... We will have a node_id_to_split_idx
map which will help us figure out which struct of the array belongs to which node.
Essentially what I am trying to do with node_stats
and split_records
is decouple the node-specific information from criterion / splitter and store it in these arrays so the criterion / splitting function can use these arrays to compute the best split for all nodes at current depth.
For each feature, one pass is done on the sorted data (using X_idx_sorted
). We maintain a sample_to_node_id
map, which along with our node_id_to_split_idx
map (both are not maps exactly but simply size_t*
arrays to make look up O(1)
... These two maps will help the criterion decide which sample belongs the which node / leaf and using that information it will update the corresponding node's split_record. (split_records[node_to_split_idx[sample_to_node_id[sample_i]]]
)
Hence in one pass, we update all the nodes at the current level (for one feature).
to compute the best split amongst all features, if it is done in serial it has to make max_features
numbers of passes over the data.
To make it parallel, we create a helper which will take the X_idx_sorted
, node_stats
, split_records
etc... and compute the best split for batches of max_features / n_jobs
features per thread.
Now the catch is that, the node_stats
cannot be a cython var if we are to make it joblib compatible, so I'll make it numpy struct array (like how the tree's node and value attribute arrays are accessed right now).
These results will be compared and combined into one best split per node at current depth...