Comparing two classification models using stambo
V1.1: © Aleksei Tiulpin, PhD, 2024
This notebook shows an end-to-end example on how one can take a dataset, train two machine learning models, and conduct a statistical test to assess whether the two models are different. We will first use a set of classical metrics (basically the metrics from sklearn). At the end of the tutorial, we will show how one can generate a LaTeX report, and implement a custom metric.
Import of necessary libraries
[1]:
import numpy as np
import stambo
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
SEED = 2024
Loading the UCI breast cancer dataset and creating train-test split
[2]:
X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.5, random_state=SEED, stratify=y)
scaler = StandardScaler()
scaler.fit(Xtr)
Xtr = scaler.transform(Xtr)
Xte = scaler.transform(Xte)
Training the models
We train a kNN and a logistic regression. Here, we can see that the logistic regression outperformes the kNN.
[3]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(Xtr, ytr)
preds_knn = model.predict_proba(Xte)[:, 1]
model = LogisticRegression(C=1e-2, random_state=42)
model.fit(Xtr, ytr)
preds_lr = model.predict_proba(Xte)[:, 1]
auc_knn, auc_lr = roc_auc_score(yte, preds_knn), roc_auc_score(yte, preds_lr)
print(f"kNN AUC: {auc_knn:.4f} / LR AUC: {auc_lr:.4f}")
kNN AUC: 0.9722 / LR AUC: 0.9918
Statistical testing
As stated in the documentation, the testing routine returns the dict
of tuple
. The keys in the dict are the metric tags, and the values are tuples that store the data in the following format:
p-value (\(H_0: model_1 = model_2\))
Empirical value (model 1)
CI low (model 1)
CI high (model 1)
Empirical value (model 2)
CI low (model 2)
CI high (model 2)
If you launch the code in Binder, decrease the number of bootstrap iterations (10000
by default).
[4]:
testing_result = stambo.compare_models(yte, preds_knn, preds_lr, metrics=("ROCAUC", "AP", "QKappa", "BACC", "MCC"), seed=SEED)
Bootstrapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:17<00:00, 576.63it/s]
If we want to visualize the testing results, they are available in a dict in the format we have described above:
[5]:
testing_result
[5]:
{'ROCAUC': (0.0165983401659834,
0.9721724465057446,
0.9488642065294073,
0.991257028580809,
0.9917782228312428,
0.9796223194281446,
0.9991088801592725),
'AP': (0.018598140185981403,
0.9699899675866734,
0.9431022140624091,
0.9908460876062002,
0.9940360662959732,
0.9843501977589723,
0.9994975413122237),
'QKappa': (0.30716928307169283,
0.8936283657691282,
0.8359946182323871,
0.9445911828990862,
0.8844563366577475,
0.8238384670856083,
0.9383926972823168),
'BACC': (0.17638236176382363,
0.9416570043217034,
0.910371840928929,
0.9699502854178561,
0.9311689680615579,
0.8970587113050348,
0.9627659574468085),
'MCC': (0.4393560643935606,
0.8945584078905953,
0.8380851036550386,
0.9448910810319505,
0.8889244497451684,
0.8329384161226888,
0.9399414434378707)}
Most commonly, we though want to visualize them in a report, paper, or a presentation. For that, we can use a function to_latex
, and get a cut-and-paste tabular
. To use it in a LaTeX document, one needs to not forget to import booktabs
[6]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))
% \usepackage{booktabs} <-- do not for get to have this imported.
\begin{tabular}{llllll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{QKappa} & \textbf{BACC} & \textbf{MCC} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.89$ [$0.84$-$0.94$] & $0.94$ [$0.91$-$0.97$] & $0.89$ [$0.84$-$0.94$] \\
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.88$ [$0.82$-$0.94$] & $0.93$ [$0.90$-$0.96$] & $0.89$ [$0.83$-$0.94$] \\
\midrule
$p$-value & $0.02$ & $0.02$ & $0.31$ & $0.18$ & $0.44$ \\
\bottomrule
\end{tabular}
Own metrics
Sometimes, having default metrics is not enough, and one may want to have some additional metrics. Let us define an F2 score.
[7]:
from sklearn.metrics import fbeta_score
from functools import partial
from stambo.metrics import Metric
[8]:
class F2Score(Metric):
def __init__(self) -> None:
Metric.__init__(self, partial(fbeta_score, beta=2), int_input=True)
def __str__(self) -> str:
return "F2Score"
[9]:
testing_result = stambo.compare_models(yte, preds_knn, preds_lr,
("ROCAUC", "AP", F2Score()),seed=SEED)
Bootstrapping: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 797.50it/s]
[10]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))
% \usepackage{booktabs} <-- do not for get to have this imported.
\begin{tabular}{llll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.97$ [$0.95$-$0.99$] \\
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\
\midrule
$p$-value & $0.02$ & $0.02$ & $0.18$ \\
\bottomrule
\end{tabular}
[11]:
testing_result
[11]:
{'ROCAUC': (0.0165983401659834,
0.9721724465057446,
0.9488642065294073,
0.991257028580809,
0.9917782228312428,
0.9796223194281446,
0.9991088801592725),
'AP': (0.018598140185981403,
0.9699899675866734,
0.9431022140624091,
0.9908460876062002,
0.9940360662959732,
0.9843501977589723,
0.9994975413122237),
'F2Score': (0.18198180181981802,
0.9711431742508325,
0.9503694283719967,
0.9875846501128666,
0.9801762114537445,
0.9665621734587252,
0.9904153354632587)}
[12]:
print(stambo.to_latex(testing_result, m1_name="kNN", m2_name="LR"))
% \usepackage{booktabs} <-- do not for get to have this imported.
\begin{tabular}{llll} \\
\toprule
\textbf{Model} & \textbf{ROCAUC} & \textbf{AP} & \textbf{F2Score} \\
\midrule
kNN & $0.97$ [$0.95$-$0.99$] & $0.97$ [$0.94$-$0.99$] & $0.97$ [$0.95$-$0.99$] \\
LR & $0.99$ [$0.98$-$1.00$] & $0.99$ [$0.98$-$1.00$] & $0.98$ [$0.97$-$0.99$] \\
\midrule
$p$-value & $0.02$ & $0.02$ & $0.18$ \\
\bottomrule
\end{tabular}