Random Forest Super Greedy Trees

Grow a forest of Super Greedy Trees (SGTs) using lasso.

rfsgt(formula,
      data,
      ntree = 100,
      hcut = 1,
      treesize = NULL,
      nodesize = NULL,
      tune.treesize = FALSE,
      filter = (hcut > 1),
      keep.only = NULL,
      fast = TRUE,
      pure.lasso = FALSE,
      eps = .005,
      maxit = 500,
      nfolds = 10,
      block.size = 10,
      bootstrap = c("by.root", "none", "by.user"),
      samptype = c("swor", "swr"),  samp = NULL, membership = TRUE,
      sampsize = if (samptype == "swor") function(x){x * .632} else function(x){x},
      seed = NULL,
      do.trace = FALSE,
      ...)

Arguments

formula: Formula describing the model to be fit.
data: Data frame containing response and features.
ntree: Number of trees to grow.
hcut: Integer value indexing type of parametric regression model to use for splitting. See details below.
treesize: Function specifying size of tree (number of terminal nodes) where first input is n sample size and second input is hcut. Can also be supplied as an integer value and defaults to an internal function if unspecified. If tune.treesize=TRUE, this is the maximum number of allowable splits. If tune.treesize=FALSE, this is the target number of tree splits.
nodesize: Minumum size of terminal node. Set internally if not specified.
tune.treesize: Adaptively determine the optimal tree size using out-of-bag empirical risk?
filter: Logical value specifying whether dimension reduction (filtering) of features should be performed.Can also be specified using the helper function tune.hcut which performs dimension reduction prior to fitting. See examples below.
keep.only: Character vector specifying the features of interest. The data is pre-filtered to keep only these requested variables. Ignored if filter is specified using tune.hcut.
fast: Use fast filtering?
pure.lasso: Logical value specifying whether lasso splitting should be strictly adhered to. In general, lasso splits are replaced with CART whenever numerical instability occurs (for example, small node sample sizes may make it impossible to obtain the cross-validated lasso parameter). This option will generally produce shallow trees which not be appropriate in all settings.
eps: Parameter used by cdlasso.
maxit: Parameter used by cdlasso.
nfolds: Number of cross-validation folds to be used for the lasso.
block.size: Determines how cumulative error rate is calculated. To obtain the cumulative error rate on every nth tree, set the value to an integer between 1 and ntree.
bootstrap: Bootstrap protocol. Default is by.root which bootstraps the data by sampling with or without replacement (without replacement is the default; see the option samptype below). If none, the data is not bootstrapped (it is not possible to return OOB ensembles or prediction error in this case). If by.user, the bootstrap specified by samp is used.
samptype: Type of bootstrap used when by.root is in effect. Choices are swor (sampling without replacement; the default) and swr (sampling with replacement).
samp: Bootstrap specification when by.user is in effect. Array of dim n x ntree specifying how many times each record appears inbag in the bootstrap for each tree.
membership: Should terminal node membership and inbag information be returned?
sampsize: Function specifying bootstrap size when by.root is in effect. For sampling without replacement, it is the requested size of the sample, which by default is .632 times the sample size. For sampling with replacement, it is the sample size. Can also be specified using a number.
seed: Negative integer specifying seed for the random number generator.
do.trace: Number of seconds between updates to the user on approximate time to completion.
...: Further arguments passed to cdlasso and rfsrc.

Details

A flexible class of parametric models are used for tree splitting using lasso. This includes CART splits, hyperplane, ellipsoid and hyperboloid cuts. Coordinate descent is used for fast calculation of the penalized lasso parametric models. Cross-validation is employed to obtain the lasso regularization parameter.

These trees are called super greedy trees (SGTs) and are constructed using best split first (BSF) where cuts are made sequentially in order of greatest empirical risk reduction.

Parametric linear models used for splitting are indexed by parameter hcut corresponding to the following geometric regions:

hcut=1 (hyperplane) linear model using all variables.
hcut=2 (ellipse) plus all quadratic terms.
hcut=3 (oblique ellipse) plus all pairwise interactions.
hcut=4 plus all polynomials of degree 3 of two variables.
hcut=5 plus all polynomials of degree 4 of three variables.
hcut=6 plus all three-way interactions.
hcut=7 plus all four-way interactions.

Setting hcut=0 gives CART splits where cuts are parallel to the coordinate axis (axis-aligned cuts). Thus, hcut=0 is similar to random forests.

Value

A forest of SGTs trained on the learning data which can be used for prediction.

Author

Hemant Ishwaran and Udaya B. Kogalur

References

Ishwaran H. (2023). Super greedy regression trees with coordinate descent. Technical Report.

Examples