rfsgt.RdGrow a forest of Super Greedy Trees (SGTs) using lasso. In addition to prediction, the fitted forest supports local beta-value and partial-contribution summaries that show how predictions are assembled.
rfsgt(formula,
data,
ntree = if (hcut == 0) 500 else 100,
hcut = 1,
treesize = NULL,
nodesize = NULL,
filter = (hcut > 1),
bsf = if (hcut > 0) "oob" else "inbag",
keep.only = NULL,
fast = TRUE,
pure.lasso = FALSE,
eps = .005,
maxit = 500,
nfolds = 10,
block.size = 10,
bootstrap = c("by.root", "none", "by.user"),
samptype = c("swor", "swr"), samp = NULL, membership = TRUE,
sampsize = if (samptype == "swor") function(x){x * .632} else function(x){x},
seed = NULL,
do.trace = FALSE,
...)Formula describing the model to be fit.
Data frame containing response and features.
Number of trees to grow.
Integer value indexing type of parametric regression model to use for splitting. See details below.
Function specifying upper bound for size of tree
(number of terminal nodes) where first input is n sample size
and second input is hcut. Can also be supplied as an integer
value and defaults to an internal function if unspecified.
Minumum size of terminal node. Set internally if not specified.
Logical value specifying whether dimension reduction
(filtering) of features should be performed.Can also be specified
using the helper function tune.hcut which performs dimension
reduction prior to fitting. See examples below.
Best split first (BSF) empirical risk minimization
strategy. Accepted values are "oob" and "inbag".
The default is "oob" when hcut>0 and
"inbag" otherwise. Using OOB (out-of-bag) empirical risk
minimization can improve robustness, but it can also optimistically
bias OOB error estimates. Setting bsf="inbag" gives pure
inbag empirical risk minimization. If filter is supplied
using tune.hcut and bsf is omitted, the stored
tuning value is used.
Character vector specifying the features of interest.
The data is pre-filtered to keep only these requested variables.
Ignored if filter is specified using tune.hcut.
Use fast filtering?
Logical value specifying whether lasso splitting should be strictly adhered to. In general, lasso splits are replaced with CART whenever numerical instability occurs (for example, small node sample sizes may make it impossible to obtain the cross-validated lasso parameter). This option will generally produce shallow trees which may not be appropriate in all settings.
Parameter used by cdlasso.
Parameter used by cdlasso.
Number of cross-validation folds to be used for the lasso.
Determines how cumulative error rate is calculated. To
obtain the cumulative error rate on every nth tree, set the value to
an integer between 1 and ntree.
Bootstrap protocol. Default is by.root which
bootstraps the data by sampling with or without replacement (without
replacement is the default; see the option samptype below).
If none, the data is not bootstrapped (it is not possible to
return OOB ensembles or prediction error in this case). If
by.user, the bootstrap specified by samp is used.
Type of bootstrap used when by.root is in
effect. Choices are swor (sampling without replacement;
the default) and swr (sampling with replacement).
Bootstrap specification when by.user is in
effect. Array of dim n x ntree specifying how many
times each record appears inbag in the bootstrap for each tree.
Should terminal node membership and inbag information be returned?
Function specifying bootstrap size when by.root
is in effect. For sampling without replacement, it is the requested
size of the sample, which by default is .632 times the sample size.
For sampling with replacement, it is the sample size. Can also be
specified using a number.
Negative integer specifying seed for the random number generator.
Number of seconds between updates to the user on approximate time to completion.
Further arguments passed to cdlasso and
rfsrc.
Super Greedy Trees (SGTs) are tree-based models that extend ordinary CART-style splitting in a fundamental way. In a standard tree, a split typically tests one variable at a time, so tilted, curved, or interaction-driven decision boundaries must be approximated by many small axis-aligned cuts. SGTs instead learn a sparse score inside a node and then split the node using that score. This allows a split to depend on several variables at once, so the fitted tree can represent hyperplane, elliptical, hyperboloid, and other higher-order geometric boundaries much more directly.
Operationally, the procedure is organized in stages. First, a family of
candidate score functions is chosen through hcut. Second, within
each node, a lasso model is fit by coordinate descent. Third, the fitted
node-wise score is used to order observations, and an allowable threshold
on that score is searched for the best split. The resulting daughter
nodes are then re-fit and compared using empirical risk. In this way,
the difficult multivariate split problem is converted into a manageable
one-dimensional threshold search along a learned score.
Tree growth uses a best split first (BSF) strategy. Rather than expanding nodes in strict depth-first or breadth-first order, BSF scans the current terminal nodes, measures the reduction in empirical risk for each candidate split, and grows the node giving the largest gain. This makes the search aggressive but focused: computation is directed to the part of the tree that is most promising at the current stage of growth. In a forest, repeating this over bootstrap samples yields an ensemble of trees with the flexibility of multivariate cuts and the stabilizing effect of aggregation.
The main user control for split geometry is hcut. Smaller
values give simpler cut families; larger values allow richer
polynomial and interaction structure. Thus rfsgt is able to
span random-forest-like axis-aligned splits all the way to
genuinely multivariate geometric partitioning. When the signal is
approximately linear or only mildly nonlinear, smaller hcut
values may be sufficient. When the signal involves curvature or
interactions, larger hcut values can reduce bias and often
achieve the same structural fit with fewer splits.
An equally important feature of SGTs is that the fitted forest is not only a prediction device. Because each split score and each terminal-node predictor is a sparse lasso model, the forest also carries local coefficient information. For a given observation, each tree contributes the coefficient vector from the terminal node containing that observation, and averaging across trees yields forest-level beta summaries that are usually more stable than the coefficients from any single tree. These beta values are local and data-adaptive: they can change from one region of the feature space to another, so they should be viewed as coefficient functions rather than a single global regression vector.
This gives SGTs a genuinely hybrid character. The partition of the
feature space is nonparametric, as in a tree or forest, but within each
local region the fitted response is represented by a sparse parametric
expansion. Built-in helpers such as get.beta expose this
structure by returning both beta summaries and corresponding partial
term contributions. For a main effect, the contribution is the local
beta multiplied by the covariate value; for an interaction, it is the
local interaction coefficient multiplied by the associated product term.
In many applications, these summaries can be as informative as the
prediction itself because they show which variables, nonlinear terms,
and interactions are driving the fitted value near the observation of
interest.
Parametric models used for splitting are indexed by hcut
corresponding to the following geometric regions:
hcut=1 (hyperplane) linear model using all variables.
hcut=2 (ellipse) plus all quadratic terms.
hcut=3 (oblique ellipse) plus all pairwise interactions.
hcut=4 plus all polynomials of degree 3 of two variables.
hcut=5 plus all polynomials of degree 4 of three variables.
hcut=6 plus all three-way interactions.
hcut=7 plus all four-way interactions.
Setting hcut=0 gives CART splits where cuts are parallel to the
coordinate axis (axis-aligned cuts). Thus, hcut=0 is similar to
random forests and can be viewed as the baseline case of the SGT
framework.
A major part of the implementation is devoted to regularization and stabilization, because richer cut families can otherwise become unstable in small nodes or in the presence of collinearity. The first safeguard is the lasso itself. At each node, the split-defining score is fit with an \(\ell_1\) penalty, and the penalty is chosen by cross-validation. This induces sparsity, controls local complexity, and lets the procedure adapt to the amount of information available in the node. Near the root, where sample sizes are larger, the fitted score can support richer geometry. Deeper in the tree, lasso sparsity and smaller node sizes tend to simplify the split automatically.
A second safeguard is feature filtering. When the predictor dimension is
moderate or large, the candidate dictionary implied by hcut can be
very large. The implementation can therefore pre-filter variables using
shallow pilot fits and retain only variables that appear with nonzero
lasso coefficients. This reduces runtime and variance before the final
forest is grown. The helper tune.hcut is the intended front-end
for this step: it can pre-filter variables and also choose a suitable
hcut value before the final call to rfsgt. In practice,
this is often the safest workflow when interactions or higher-order terms
are being considered.
A third safeguard is automatic simplification of a branch when the lasso modeling is no longer paying off. The algorithm then replaces the lasso-induced split by the best CART coordinate-threshold split at that node and in place of local lasso node estimators, simple CART-style sample-average predictors are used along that branch. In other words, in the presence of numerical instability, the procedure can simplify both the split geometry and the local node model, which prevents unstable or unnecessary parametric structure from being pushed deeper into the tree.
The argument pure.lasso controls whether this simplification is
allowed. With the default behavior, a branch may switch from the
richer lasso-based representation to a simpler CART-style
representation when the latter is more stable or gives better
empirical risk reduction. Setting pure.lasso=TRUE shuts this
switch off. This is useful when a user wants a fully lasso-defined
tree for methodological reasons, but in difficult data settings it can
also lead to shallower trees because branches that would otherwise be
stabilized by CART are instead left unsplit.
The argument bsf, although related, addresses a different
question. It does not turn CART fallback on or off. Instead, it
decides which data drive empirical risk minimization during BSF
search. With bsf="inbag", the split comparison is based on
in-bag data. With bsf="oob", the same comparison uses
out-of-bag (OOB) data as a held-out check. This indirectly has
implications when deciding on CART fallback, however, because it
allows the algorithm to use out of sample data to decide whether the
current SGT candidate really improves generalization relative to the
best CART candidate at the same node. If the CART candidate wins
under the chosen risk criterion, the branch is simplified exactly as
described above: CART split geometry is used and CART-style node
estimators replace the local lasso fits down that branch. This guard
is especially helpful under potentially numerically unstable models
when hcut > 0. As with any model-selection procedure that
reuses held-out data, however, users should remember that the reported
OOB error can then be mildly optimistic.
From a practical point of view, the main tuning parameters have clear
roles. Use hcut to control cut richness, treesize and
nodesize to control tree complexity, filter or
tune.hcut when the feature space is large, and
pure.lasso only when you explicitly want to over-ride CART
fallback. Users new to the method can usually start with the
defaults. Users working with interaction-heavy or high-dimensional
data will typically benefit from filtering, tune.hcut, and
OOB-based BSF. Users interested mainly in prediction can simply call
predict.rfsgt. Users interested in the local parametric
structure can query the same fitted forest with get.beta to
recover beta summaries and partial contributions without fitting a
separate surrogate model.
An object of class c("rfsgt", "grow", family) containing the
trained super greedy forest and associated training-data summaries. The
object is a list with the following components:
familyModel family.
nNumber of observations in the training data.
bootstrapBootstrap protocol used to grow the forest.
samptypeSampling type used for root-node bootstrap samples.
sampsizeTree sample size used by the bootstrap protocol.
ntreeNumber of trees grown.
hcutFinal hcut value used to construct the
splitting model.
splitruleSplit rule used by the forest.
splitinfoInternal split-rule metadata, including the
processed hcut, split-rule index, number of random split
points, and lasso cross-validation setting.
yvarTraining response values.
yvar.namesResponse variable name.
yvar.factorFactor-level metadata for the response.
yvar.typesInternal response type code.
xvarTraining predictor data after hot encoding, optional filtering, and any user-requested variable restriction.
xvar.augmentAugmented base-learner data used for
higher-order hcut terms, or NULL when no augmented
terms are used.
xvar.namesNames of the retained predictor variables.
xvar.typesInternal predictor type codes.
xvar.infoPredictor-encoding metadata used internally;
this is NULL for standard grow objects.
xvar.augment.namesNames of augmented base-learner
terms, or NULL when no augmented terms are used.
term.mapTerm map describing how generated base-learner terms correspond to original variables and powers.
forestStored forest object used by
predict.rfsgt. This component contains the native node
array, per-tree leaf counts, terminal-node offsets, bootstrap
membership identifiers, fitting options, and package-version
metadata.
nodeStatNode statistics returned when empirical-risk
output is available; otherwise NULL.
empr.riskInbag empirical-risk values by candidate tree
size and tree, or NULL when unavailable.
oob.empr.riskOut-of-bag empirical-risk values by
candidate tree size and tree, or NULL when unavailable.
empr.risk.cartInbag empirical-risk values for CART
fallback splits, or NULL when unavailable.
oob.empr.risk.cartOut-of-bag empirical-risk values for
CART fallback splits, or NULL when unavailable.
bsfBest-split-first empirical-risk strategy used.
bsf.orderBest-split-first node expansion order, or
NULL when unavailable.
predictedTraining-data ensemble predictions.
predicted.oobOut-of-bag training-data predictions when
available; otherwise NULL.
membershipMatrix of terminal-node membership by
observation and tree when membership = TRUE; otherwise
NULL.
inbagMatrix of bootstrap counts by observation and
tree when membership = TRUE; otherwise NULL.
ensembleInternal ensemble type used for prediction summaries.
err.rateCumulative prediction error, typically the
out-of-bag mean squared error for regression when available;
otherwise NULL.
ctime.internalTiming information reported by the native grow routine.
ctime.externalElapsed R-side timing, computed from
proc.time().
Ishwaran H. (2025). Super greedy trees. To appear in Artificial Intelligence Review.
## ------------------------------------------------------------
##
## mtcars: for CRAN testing
##
## ------------------------------------------------------------
print(rfsgt(mpg~., mtcars, ntree=3, treesize=1))
# \donttest{
## ------------------------------------------------------------
##
## boston housing
##
## ------------------------------------------------------------
if (requireNamespace("mlbench", quietly = TRUE)) {
## load the data
data(BostonHousing, package = "mlbench")
## default basic call
print(rfsgt(medv~., BostonHousing))
## variable selection
sort(vimp.rfsgt(medv~.,BostonHousing))
## examples of hcut=0 (similar to random forests ... but using BSF)
print(rfsgt(medv~., BostonHousing, hcut=0))
print(rfsgt(medv~., BostonHousing, hcut=0, nodesize=1))
## hcut=1 with smaller nodesize
print(rfsgt(medv~., BostonHousing, nodesize=1))
## ------------------------------------------------------------
##
## boston housing with factors
##
## ------------------------------------------------------------
## load the data
data(BostonHousing, package = "mlbench")
## make some features into factors
Boston <- BostonHousing
Boston$zn <- factor(Boston$zn)
Boston$chas <- factor(Boston$chas)
Boston$lstat <- factor(round(0.2 * Boston$lstat))
Boston$nox <- factor(round(20 * Boston$nox))
Boston$rm <- factor(round(Boston$rm))
## random forest: hcut=0
print(rfsgt(medv~., Boston, hcut=0, nodesize=1))
## hcut=3
print(rfsgt(medv~., Boston, hcut=3))
## ------------------------------------------------------------
##
## ozone
##
## ------------------------------------------------------------
## load the data
data(Ozone, package = "mlbench")
print(rfsgt(V4~., na.omit(Ozone), hcut=0, nodesize=1))
print(rfsgt(V4~., na.omit(Ozone), hcut=1))
print(rfsgt(V4~., na.omit(Ozone), hcut=2))
print(rfsgt(V4~., na.omit(Ozone), hcut=3))
}
## ------------------------------------------------------------
##
## non-linear boundary illustrates hcut using single tree
##
## ------------------------------------------------------------
## simulate non-linear boundary
n <- 500
p <- 5
signal <- 10
treesize <- 10
ngrid <- 200
## train
x <- matrix(runif(n * p), ncol = p)
fx <- signal * sin(pi * x[, 1] * x[, 2])
nl2d <- data.frame(y = fx, x)
## truth
x1 <- x2 <- seq(0, 1, length = ngrid)
truth <- signal * sin(outer(pi * x1, x2, "*"))
## test
x.tst <- do.call(rbind, lapply(x1, function(x1j) {
cbind(x1j, x2, matrix(runif(length(x2) * (p-2)), ncol=(p-2)))
}))
colnames(x.tst) <- colnames(x)
fx.tst <- signal * sin(pi * x.tst[, 1] * x.tst[, 2])
nl2d.tst <- data.frame(y = fx.tst, x.tst)
## SGT for different hcut values
## Enforce pure lasso
rO <- lapply(0:4, function(hcut) {
cat("hcut", hcut, "\n")
rfsgt(y~., nl2d, ntree=1, hcut=hcut, treesize=treesize, bootstrap="none",
pure.lasso = TRUE, nodesize=1, filter=FALSE)
})
## nice little wrapper for plotting results
if (library("interp", logical.return = TRUE)) {
## nice little wrapper for plotting results
plot.image <- function(x, y, z, linear=TRUE, nlevels=40, points=FALSE) {
xo <- x; yo <- y
so <- interp(x=x, y=y, z=z, linear=linear, nx=nlevels, ny=nlevels)
x <- so$x; y <- so$y; z <- so$z
xlim <- ylim <- range(c(x, y), na.rm = TRUE, finite = TRUE)
z[is.infinite(z)] <- NA
zlim <- q <- quantile(z, c(.01, .99), na.rm = TRUE)
z[z<=q[1]] <- q[1]
z[z>=q[2]] <- q[2]
levels <- pretty(zlim, nlevels)
col <- hcl.colors(length(levels)-1, "YlOrRd", rev = TRUE)
plot.new()
plot.window(xlim, ylim, "", xaxs = "i", yaxs = "i", asp = NA)
.filled.contour(x, y, z, levels, col)
axis(1);axis(2)
if (points)
points(xo,yo ,pch=16, cex=.25)
box()
invisible()
}
oldpar <- par(mfrow=c(3,2))
image(x1, x2, truth, xlab="", ylab="")
contour(x1, x2, truth, nlevels = 15, add = TRUE, drawlabels = FALSE)
mtext(expression(x[1]),1,line=2)
mtext(expression(x[2]),2,line=2)
mtext(expression("truth"),3,line=1)
pO <- lapply(0:4, function(j) {
plot.image(nl2d.tst[,"X1"],nl2d.tst[,"X2"], predict(rO[[j+1]], nl2d.tst)$predicted)
contour(x1, x2, truth, nlevels = 15, add = TRUE, drawlabels = FALSE)
mtext(expression(x[1]),1,line=2)
mtext(expression(x[2]),2,line=2)
mtext(paste0("hcut=",j),3,line=1)
NULL
})
par(oldpar)
}
## ------------------------------------------------------------
##
## friedman illustration of OOB empirical risk
##
## ------------------------------------------------------------
if (requireNamespace("mlbench", quietly = TRUE)) {
## simulate friedman
n <- 500
dta <- data.frame(mlbench::mlbench.friedman1(n, sd=0))
## rf versus rfsgt
o1 <- rfsgt(y~., dta, hcut=0, block.size=1)
o2 <- rfsgt(y~., dta, hcut=3, block.size=1)
## compute running average of OOB empirical risk
runavg <- function(x, lag = 8) {
x <- c(na.omit(x))
lag <- min(lag, length(x))
cx <- c(0,cumsum(x))
rx <- cx[2:lag] / (1:(lag-1))
c(rx, (cx[(lag+1):length(cx)] - cx[1:(length(cx) - lag)]) / lag)
}
risk1 <- lapply(data.frame(o1$oob.empr.risk), runavg)
leaf1 <- o1$forest$leafCount
risk2 <- lapply(data.frame(o2$oob.empr.risk), runavg)
leaf2 <- o2$forest$leafCount
## compare OOB empirical tree risk to OOB forest error
oldpar <- par(mfrow=c(2,2))
plot(c(1,max(leaf1)), range(c(risk1)), type="n",
xlab="Tree size", ylab="RF OOB empirical risk")
l1 <- do.call(rbind, lapply(risk1, function(rsk){
lines(rsk,col=grey(0.8))
cbind(1:length(rsk), rsk)
}))
lines(tapply(l1[,2], l1[,1], mean), lwd=3)
plot(c(1,max(leaf2)), range(c(risk2)), type="n",
xlab="Tree size", ylab="SGF OOB empirical risk")
l2 <- do.call(rbind, lapply(risk2, function(rsk){
lines(rsk,col=grey(0.8))
cbind(1:length(rsk), rsk)
}))
lines(tapply(l2[,2], l2[,1], mean), lwd=3)
plot(1:o1$ntree, o1$err.rate, type="s", xlab="Trees", ylab="RF OOB error")
plot(1:o2$ntree, o2$err.rate, type="s", xlab="Trees", ylab="SGF OOB error")
par(oldpar)
}
## ------------------------------------------------------------
##
## synthetic regression examples with different hcut
##
## ------------------------------------------------------------
if (requireNamespace("mlbench", quietly = TRUE)) {
## simulation functions
sim <- list(
friedman1=function(n){data.frame(mlbench::mlbench.friedman1(n))},
friedman2=function(n){data.frame(mlbench::mlbench.friedman2(n))},
friedman3=function(n){data.frame(mlbench::mlbench.friedman3(n))},
peak=function(n){data.frame(mlbench::mlbench.peak(n, 10))},
linear=function(n, sd=.1){
x=matrix(runif(n*10), n)
y=3*x[,1]^3-2*x[,2]^2+3*x[,3]+rnorm(n,sd=sd)
data.frame(y=y,x)
})
## run rfsgt on the simulations
n <- 500
max.hcut <- 3
results <- setNames(lapply(names(sim), function(nm) {
cat("simulation:", nm, "\n")
d <- sim[[nm]](n=n)
rO <- data.frame(do.call(rbind, lapply(0:max.hcut, function(hcut) {
cat(" hcut:", hcut, "\n")
o <- rfsgt(y~.,d,hcut=hcut)
c(hcut, tail(o$err.rate, 1), tail(o$err.rate, 1) / var(o$yvar))
})))
colnames(rO) <- c("hcut", "mse", "smse")
rO
}), names(sim))
## print results
print(results)
## ------------------------------------------------------------
##
## synthetic regression example showing how to tune hcut
##
## ------------------------------------------------------------
hcut.opt <- setNames(sapply(names(sim), function(nm) {
cat("optimize hcut for simulation:", nm, "\n")
f <- tune.hcut(y~., sim[[nm]](n=n), hcut=4)
attr(f, "hcut")
}), names(sim))
## print the optimal hcut
print(hcut.opt)
}
## ------------------------------------------------------------
##
## iowa housing data
##
## ------------------------------------------------------------
data(housing, package = "randomForestSRC")
## remove PID
housing$PID <- NULL
## rough missing data imputation
d <- randomForestSRC::impute(data = data.frame(data.matrix(housing)))
d$SalePrice <- log(d$SalePrice)
d <- data.frame(data.matrix(d))
print(rfsgt(SalePrice~.,d))
## ------------------------------------------------------------
##
## high-dimensional model with variable selection
##
## ------------------------------------------------------------
## simulate big p small n data
n <- 50
p <- 500
d <- data.frame(y = rnorm(n), x = matrix(rnorm(n * p), n))
## we have a big p small n pure noise setting: let's see how well we do
cat("variables selected by vimp.rfsgt:\n")
vmp <- sort(vimp.rfsgt(y~.,d))
print(vmp[vmp > .05])
## internal filtering function can also be used
cat("variables selected by filter.rfsgt:\n")
print(filter.rfsgt(y~.,d, method="conserve"))
## ------------------------------------------------------------
##
## pre-filtering using keep.only
##
## ------------------------------------------------------------
if (requireNamespace("mlbench", quietly = TRUE)) {
## simulate the data
n <- 100
p <- 50
noise <- matrix(runif(n * p), ncol=p)
dta <- data.frame(mlbench::mlbench.friedman1(n, sd=0), noise=noise)
## filter the variables
f <- filter.rfsgt(y~., dta)
## use keep.only to pre-filter the features
print(rfsgt(y~.,dta, keep.only=f, hcut=1))
print(rfsgt(y~.,dta, keep.only=f, hcut=2))
print(rfsgt(y~.,dta, keep.only=f, hcut=3))
## ------------------------------------------------------------
##
## tuning hcut and pre-filtering using tune.hcut
##
## ------------------------------------------------------------
## simulate the data
n <- 100
p <- 50
noise <- matrix(runif(n * p), ncol=p)
dta <- data.frame(mlbench::mlbench.friedman1(n, sd=0), noise=noise)
## tune hcut
f <- tune.hcut(y~., dta, hcut=3)
## use the optimized hcut
print(rfsgt(y~.,dta, filter=f))
## over-ride the tuned hcut value
print(rfsgt(y~.,dta, filter=use.tune.hcut(f, hcut=1)))
print(rfsgt(y~.,dta, filter=use.tune.hcut(f, hcut=2)))
print(rfsgt(y~.,dta, filter=use.tune.hcut(f, hcut=3)))
## ------------------------------------------------------------
##
## get local beta values and partial contributions
##
## SGTs are not only predictive; the same fit can be queried
## for local parametric summaries. We use friedman 1 for
## illustration.
##
## ------------------------------------------------------------
n <- 100
dta <- data.frame(mlbench::mlbench.friedman1(n))
o <- rfsgt(y~., dta, hcut=3, pure.lasso=TRUE, treesize=10)
bO <- get.beta(o)
print(str(bO$beta))
print(str(bO$partial))
}
# }