Model Selection¶
VBPCApy provides two strategies for choosing the number of latent components \(k\): a sequential sweep and K-fold cross-validation.
select_n_components — sequential sweep¶
Fits VB-PCA for each candidate \(k\) and selects the best according to a chosen metric.
from vbpca_py import select_n_components, SelectionConfig
cfg = SelectionConfig(metric="cost", patience=2, max_trials=12)
best_k, metrics, trace, best_model = select_n_components(
x, mask=mask, components=range(1, 15), config=cfg, maxiters=200
)
Available metrics¶
| Metric | Description |
|---|---|
"cost" |
Variational free energy (negative ELBO). Lower is better. |
"prms" |
Probe-set RMS — reconstruction error on held-out entries. Requires a probe set via xprobe or xprobe_fraction. |
SelectionConfig fields¶
| Field | Default | Description |
|---|---|---|
metric |
"prms" |
Selection metric |
stop_on_metric_reversal |
True |
Stop sweeping when the metric worsens |
patience |
None |
Consecutive worsening trials before stopping |
max_trials |
None |
Cap on the number of \(k\) values tried |
compute_explained_variance |
True |
Compute explained variance for the best model |
return_best_model |
False |
Include the fitted VBPCA object in the return |
Return value¶
select_n_components returns a 4-tuple:
best_k— the selected number of components.best_metrics— endpoint metrics dict for the winning \(k\).trace— list of per-\(k\) metric dicts.best_model— the fittedVBPCAinstance (ifreturn_best_model=True, elseNone).
cross_validate_components — K-fold CV¶
Partitions the observed entries (not full rows) into folds, fits on each training fold, and evaluates on the held-out fold.
from vbpca_py import cross_validate_components, CVConfig
cfg = CVConfig(n_splits=5, metric="prms", one_se_rule=True)
best_k, results = cross_validate_components(
x, mask=mask, components=range(1, 10), config=cfg, maxiters=200
)
CVConfig fields¶
| Field | Default | Description |
|---|---|---|
n_splits |
5 |
Number of CV folds |
metric |
"prms" |
Metric to evaluate on held-out entries |
one_se_rule |
True |
Select the simplest model within 1 SE of the best |
seed |
0 |
Random seed for fold assignment |
1-SE rule¶
When one_se_rule=True, the selected \(k\) is the smallest value whose mean
CV metric is within one standard error of the overall best. This favours
simpler models — fewer components — when the improvement from additional
components is not statistically significant.
Choosing between the two¶
select_n_components |
cross_validate_components |
|
|---|---|---|
| Speed | Faster — one fit per \(k\) | Slower — \(k \times \text{n\_splits}\) fits |
| Reliability | Good with probe set | More robust variance estimate |
| Best for | Quick exploration, large data | Publication-quality model selection |