Examples¶
Minimal example¶
msitrees follows scikit-learn API style, which allows for fast model iteration. Below is an example, where decision
tree classifier is fitted and scored over 10 fold cross validation using cross_val_score().
from msitrees.tree import MSIDecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
data = load_iris()
clf = MSIDecisionTreeClassifier()
cross_val_score(clf, data['data'], data['target'], cv=10)
# array([1. , 1. , 1. , 0.93333333, 0.93333333,
# 0.8 , 0.93333333, 0.86666667, 0.8 , 1. ])
Model preservation¶
Model preservation is possible with pickle module.
import pickle
with open('model.pkl', 'wb') as file:
pickle.dump(clf, file)
The same can be used to load the model back.
with open('model.pkl', 'rb') as file:
loaded_model = pickle.load(file)
Zero hyperparameter based approach¶
MSI based algorithm should have performance comparable to CART decision tree where best hyperparameters were established with
some sort of search. We are going to compare MSIRandomForestClassifier with scikit-learn implementation of random forest
algorithm with hyperparameters grid searched using optuna. Both algorithms will be limited to 100 estimators, and measured
by comparing accuracy on validation set of MNIST dataset.
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_digits()
x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)
def objective(trial):
params = {
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 10),
'max_depth': trial.suggest_int('max_depth', 8, 20),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 10),
'random_state': 42,
'n_estimators': 100
}
clf = RandomForestClassifier(**params)
clf.fit(x_train, y_train)
pred = clf.predict(x_valid)
score = accuracy_score(y_valid, pred)
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_jobs=-1, show_progress_bar=True, n_trials=500)
# fit benchmark model on best params
benchmark = RandomForestClassifier(**study.best_params)
benchmark = benchmark.fit(x_train, y_train)
pred = benchmark.predict(x_valid)
accuracy_score(y_valid, pred)
# 0.9711111111111111
Since MSI based algorithm has no additional hyperparameters, code is sparse.
from msitrees.ensemble import MSIRandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = load_digits()
x_train, x_valid, y_train, y_valid = train_test_split(data['data'], data['target'], random_state=42)
clf = MSIRandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
pred = msiclf.predict(x_valid)
accuracy_score(y_valid, pred)
# 0.9733333333333334
Results for both random forest algorithms are comparable. Furthermore, median depth of a tree estimator is equal for both methods, even though MSI has no explicit parameter controlling tree depth.
np.median([e.get_depth() for e in benchmark.estimators_])
# 12.0
np.median([e.get_depth() for e in clf._estimators])
# 12.0