0

I’ve built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150 below. The real-valued f60150 in the bottommost chart is easy to interpret – its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, < -9.53674e-07 doesn't make sense. Each of these features is either 1 or 0. -9.53674e-07 is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing vs. no corresponds to which true/false side of these binary feature nodes? Here is a reproducible example: import numpy as np import scipy.sparse from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from xgboost import plot_tree, XGBClassifier import matplotlib.pyplot as pltdef booleanize_csr_matrix(mat): ''' Convert sparse matrix with positive integer elements to 1s ''' nnz_inds = mat.nonzero() keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result

### Setup dataset
res = fetch_20newsgroups()

text = res.data
outcome = res.target

### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)

# Whether to “booleanize” the input matrix
booleanize = True

# Whether to, after “booleanizing”, convert the data type to match what’s returned by `vec.fit_transform(text)`
to_int = True

if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)

# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)

# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir=’LR’); plt.show()

Running the above with booleanize and to_int set to True yields the following chart:

Running the above with booleanize and to_int set to False yields the following chart:

Heck, even if I do a really simple example, I get the “right” results, regardless of whether X or y are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)

y = np.array([1,0,0,0,1,1,1,0,1,1])

model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir=’LR’); plt.show()

Kuldeep Baberwal Changed status to publish February 17, 2025