Cross-Validation & Diagnostics¶
This tutorial demonstrates K-fold cross-validated model selection and post-fit diagnostic inspection.
Generate data¶
import numpy as np
rng = np.random.default_rng(42)
n_features, n_samples, true_rank = 40, 150, 4
noise_std = 0.5
W = rng.standard_normal((n_features, true_rank))
S = rng.standard_normal((true_rank, n_samples))
X = W @ S + noise_std * rng.standard_normal((n_features, n_samples))
# 15% missing
mask = rng.random(X.shape) > 0.15
X_obs = np.where(mask, X, np.nan)
Cross-validate component count¶
cross_validate_components partitions the observed entries (not full rows)
into folds. Each fold holds out a subset of entries, fits on the rest, and
evaluates reconstruction on the held-out entries.
from vbpca_py import cross_validate_components, CVConfig
cfg = CVConfig(
n_splits=5,
metric="prms", # probe RMS on held-out entries
one_se_rule=True, # prefer simpler model within 1 SE of best
seed=42,
)
best_k, results = cross_validate_components(
X_obs,
mask=mask,
components=range(1, 10),
config=cfg,
maxiters=200,
)
print(f"Selected k = {best_k} (true rank = {true_rank})")
Inspect CV results¶
for entry in results:
k = entry["k"]
mean_metric = entry["mean"]
se = entry["se"]
marker = " <-- selected" if k == best_k else ""
print(f" k={k:2d} prms={mean_metric:.4f} ± {se:.4f}{marker}")
1-SE rule¶
When one_se_rule=True, the selected \(k\) is the smallest value whose mean
metric is within one standard error of the overall best. This guards against
overfitting by favouring parsimony.
Fit the final model¶
from vbpca_py import VBPCA, make_xprobe_mask
# Create a probe set for monitoring
X_train, X_probe = make_xprobe_mask(X_obs, fraction=0.10, rng=rng)
model = VBPCA(n_components=best_k, maxiters=300, verbose=1)
model.fit(X_train, mask=~np.isnan(X_train), xprobe=X_probe)
Convergence diagnostics¶
Learning curves¶
print(f"Final RMS: {model.rms_:.6f}")
print(f"Final probe RMS: {model.prms_:.6f}")
print(f"Final cost: {model.cost_:.2f}")
Explained variance¶
evr = model.explained_variance_ratio_
if evr is not None:
total = 0.0
for i, v in enumerate(evr):
total += v
print(f" PC{i+1}: {v:.3f} (cumulative: {total:.3f})")
Plotting¶
Install the plotting extra if you haven't already:
from vbpca_py.plotting import scree_plot, loadings_barplot, variance_explained_plot
# Scree plot — explained variance ratio per component
fig = scree_plot(model, cumulative=True)
# Loadings bar plot — feature importance for the first component
fig = loadings_barplot(model, component=0, top_n=15)
# Absolute variance per component
fig = variance_explained_plot(model)
Scree plot¶
The scree plot shows the explained variance ratio for each component. A sharp "elbow" suggests the number of meaningful components. The cumulative line shows the total variance explained.
Loadings bar plot¶
The loadings bar plot for a given component highlights which features
contribute most. Use top_n to show only the highest-magnitude features.
Next steps¶
- Convergence Tuning — fine-tune stopping criteria for tricky datasets.
- Model Selection concepts — understand the sweep logic and 1-SE rule in depth.