Sparse Data Workflows¶

This tutorial shows how to use VBPCApy with sparse matrices — the natural format for high-dimensional data with structural zeros such as genomics count matrices, term-document matrices, or one-hot-encoded survey data.

Construct sparse data¶

import numpy as np
import scipy.sparse as sp

rng = np.random.default_rng(42)

n_features, n_samples, true_rank = 100, 500, 4

# Low-rank signal
W = rng.standard_normal((n_features, true_rank))
S = rng.standard_normal((true_rank, n_samples))
X_dense = W @ S + 0.3 * rng.standard_normal((n_features, n_samples))

# Zero out 80% of entries to simulate structural sparsity
sparsity_mask = rng.random((n_features, n_samples)) < 0.20
X_dense[~sparsity_mask] = 0.0

X_sparse = sp.csr_matrix(X_dense)

print(f"Shape: {X_sparse.shape}")
print(f"Stored entries (nnz): {X_sparse.nnz}")
print(f"Density: {X_sparse.nnz / np.prod(X_sparse.shape):.2%}")

Key concept: sparsity = observation mask¶

For sparse inputs, the stored entries (including stored zeros) define the observation set. Entries not stored in the CSR/CSC structure are treated as unobserved (missing).

This means you generally do not need to pass a separate mask:

from vbpca_py import VBPCA

model = VBPCA(n_components=4, maxiters=200, verbose=1)
scores = model.fit_transform(X_sparse)

If you do pass an explicit mask, it must match spones(X) exactly:

mask = X_sparse.copy()
mask.data[:] = 1.0

model = VBPCA(n_components=4, maxiters=200)
model.fit(X_sparse, mask=mask)

Sparse preprocessing¶

For categorical data stored in sparse format, use MissingAwareSparseOneHotEncoder:

from vbpca_py import MissingAwareSparseOneHotEncoder

# Single categorical column as a sparse vector
cat_col = sp.csr_matrix(rng.choice([0, 1, 2, 3], size=(100, 1)).astype(float))

enc = MissingAwareSparseOneHotEncoder()
encoded = enc.fit_transform(cat_col)

print(f"Encoded shape: {encoded.shape}")
print(f"Encoded type: {type(encoded)}")  # csr_matrix

Numeric categories only

MissingAwareSparseOneHotEncoder requires numeric category values. String categories cannot survive the CSR round-trip.

Compatibility modes¶

The compat_mode option controls edge-case handling for sparse data:

# Default: matches original MATLAB reference behaviour
model = VBPCA(n_components=4, compat_mode="strict_legacy")

# Updated semantics for empty-row/column detection
model = VBPCA(n_components=4, compat_mode="modern")

The difference affects how entirely-empty rows or columns are detected and removed before fitting. For most datasets the results are identical.

Dense vs sparse comparison¶

# Densify for comparison (only for small matrices!)
X_dense_full = X_sparse.toarray()
mask_dense = (X_dense_full != 0).astype(float)

model_dense = VBPCA(n_components=4, maxiters=200)
model_dense.fit(X_dense_full, mask=mask_dense)

model_sparse = VBPCA(n_components=4, maxiters=200)
model_sparse.fit(X_sparse)

print(f"Dense  RMS: {model_dense.rms_:.6f}")
print(f"Sparse RMS: {model_sparse.rms_:.6f}")

The results should be numerically close. The sparse path avoids materialising the full matrix and uses specialised C++ kernels.

inverse_transform returns dense

Even when the input is sparse, inverse_transform() always returns a dense np.ndarray. This is a known limitation.

Next steps¶

Cross-Validation & Diagnostics — select \(k\) with K-fold CV and inspect convergence.
Data Convention — when to choose sparse vs dense.