Skip to content

Quick Start

Dense data

VBPCApy expects data in features × samples layout (each column is one observation).

import numpy as np
from vbpca_py import VBPCA

# 50 features, 200 samples
x = np.random.randn(50, 200)

model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(x)
recon = model.inverse_transform()

# Learned attributes
print("Components shape:", model.components_.shape)   # (50, 5)
print("Scores shape:", model.scores_.shape)            # (5, 200)
print("RMS:", model.rms_)
print("Final cost:", model.cost_)

With missing entries

Pass a boolean mask where True = observed, False = missing:

mask = np.ones_like(x, dtype=bool)
mask[x < -2] = False  # mask some entries

model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(x, mask=mask)

# Reconstruction and marginal variance
recon = model.reconstruction_
var = model.variance_

Preprocessing pipeline

For mixed categorical + continuous data, use AutoEncoder to encode before fitting:

from vbpca_py import AutoEncoder

auto = AutoEncoder(cardinality_threshold=10, continuous_scaler="standard")
z = auto.fit_transform(x, mask=mask)

model = VBPCA(n_components=5, maxiters=100)
scores = model.fit_transform(z, mask=np.ones_like(z, dtype=bool))

# Round-trip back to original space
z_recon = model.inverse_transform()
x_recon = auto.inverse_transform(z_recon)

Sparse data

Sparse inputs must be CSR or CSC. The stored entries define the observation set (including stored zeros):

import scipy.sparse as sp
from vbpca_py import VBPCA

x_sparse = sp.csr_matrix([[1.0, 0.0], [0.0, 2.0]])

# Mask must match spones(X); omit to infer from X
mask = x_sparse.copy()
mask.data[:] = 1.0

model = VBPCA(n_components=2, maxiters=100)
scores = model.fit_transform(x_sparse, mask=mask)

Dense vs sparse masks

  • Dense: pass a boolean mask of 0/1 with the same shape.
  • Sparse: the observation set is the stored entries of X (including stored zeros). If you pass a mask it must match spones(X) exactly.

Key options

Option Description Default
n_components Number of latent components required
bias Estimate per-feature mean True
maxiters Maximum EM iterations 1000
tol Convergence tolerance 1e-4
verbose Logging verbosity (0, 1, or 2) 0
xprobe_fraction Fraction of entries to hold out as probe 0.0

See VBPCA API reference for the complete list.