Main functionality

We currently have two main functions: one that does just a two- and one-tailed bootstrap test, and another one that comapres two machine learning models evaluated on a test set.

stambo.compare_models(y_test: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], preds_1: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], preds_2: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]], metrics: Tuple[str | Metric], alpha: float | None = 0.05, two_tailed: bool = True, n_bootstrap: int = 10000, seed: int | None = None, silent: bool = False) → Dict[str, Tuple[float]][source]

Compares predictions from two models \(f_1(x)\) and \(f_1(x)\) that yield prediction vectors \(\hat y_{1}\) and \(\hat y_{2}\) with a one-tailed bootstrap hypothesis test. I.e., we state the following null and alternative hypotheses:

\[ \begin{align}\begin{aligned}H_0: M(y_{gt}, \hat y_{1}) = M(y_{gt}, \hat y_{2})\\H_1: M(y_{gt}, \hat y_{1}) < M(y_{gt}, \hat y_{2}),\end{aligned}\end{align} \]

where \(M\) is a metric, \(y_{gt}\) is the vector of ground truth labels, and \(\hat y_{i}, i=1,2\) are the vectors of predictions for model 1 and 2, respectively. Such kind of testing is performed for every specified metric.

Note that while the test does return you the \(p\)-value, one should be careful about its interpretation: the \(p\)-value is the probability of observing the test statistic at least as extreme as the one obtained assuming that:math:H_0 is true. That is: what is the probability of one model being better than the other, given that when we evaluate them on larger data they would be the same.

Beyond the hypothesis testing, the function also returns confidence intervals per metric, i.e. \([M(y_{gt}, \hat y)_{(\alpha / 2)}, M(y_{gt}, \hat y)_{(1 - \alpha / 2)}]\).

Parameters:

y_test (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Ground truth
preds_1 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Prediction from model 1
preds_2 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]]) – Prediction from model 2
metrics (Tuple[Union[str, Metric]]) – A set of metrics to call. Here, the user either specifies the metrics available from the stambo library (stambo.metrics), or adds an instance of the custom-defined metrics.
alpha (float, optional) – A significance level for confidence intervals (from 0 to 1).
n_bootstrap (int, optional) – The number of bootstrap iterations. Defaults to 10000.
seed (int, optional) – Random seed. Defaults to None.
silent (bool, optional) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.

Returns:

A dictionary containing a tuple with the empirical value of the metric, and the p-value.

The expected format in the output in every dict entry is:

\(p\)-value
\(M(y_{gt}, \hat y_{1})\)
\(M(y_{gt}, \hat y_{1})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{1})_{(1 - \alpha / 2)}\)
\(M(y_{gt}, \hat y_{1})\)
\(M(y_{gt}, \hat y_{2})_{(\alpha / 2)}\)
\(M(y_{gt}, \hat y_{2})_{(1 - \alpha / 2)}\)

Return type:

Dict[Tuple[float]]

stambo.to_latex(report: Dict[str, Tuple[float]], m1_name: str | None = 'M1', m2_name: str | None = 'M2') → str[source]

Converts a report returned by the model into a LaTeX table for convenient viewing.

Parameters:

report (Dict[str, Tuple[float]]) – A dictionary with metrics. Use the stambo-generated format.
m1 (str, optional) – Name to assign to the table row. Defaults to M1.
m2 (str, optional) – Name to assign to the table row. Defaults to M2.

Returns:

A cut-and-paste LaTeX table in tabular environment.

Return type:

str

stambo.two_sample_test(sample_1: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]] | PredSampleWrapper, sample_2: ndarray[Any, dtype[int64]] | ndarray[Any, dtype[float64]] | PredSampleWrapper, statistics: Dict[str, Callable] | None = None, alpha: float = 0.05, two_tailed: bool = True, n_bootstrap: int = 10000, seed: int | None = None, silent: bool = False) → Dict[str, Tuple[float]][source]

Compares whether the empirical difference of statistics computed own two samples is statistically significant or not. Note that the statistics are computed independently, and should thus be treated independently.

Parameters:

sample_1 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]) – Sample 1 to be comapred
sample_2 (Union[npt.NDArray[np.int64], npt.NDArray[np.float64]) – Sample 2 to be comapred
statistics (Dict[str, Callable]) – Statistics to compare the samples by.
alpha (float, optional) – A significance level for confidence intervals (from 0 to 1).
n_bootstrap (int, optional) – The number of bootstrap iterations. Defaults to 10000.
seed (int, optional) – _description_. Random seed. Defaults to None.
silent (bool, optional) – Whether to execute the function silently, i.e. not showing the progress bar. Defaults to False.

Returns:

A dictionary containing a tuple with the empirical value of the metric, and the p-value.

The expected format in the output in every dict entry is:

p-value
empirical value (sample 1),
CI low (sample 1)
CI high (sample 1)
empirical value (sample 2),
CI low (sample 2)
CI high (sample 2).

Return type:

Dict[Tuple[float]]