Preprocessing¶
AutoEncoder ¶
AutoEncoder(*, cardinality_threshold: int = 20, continuous_scaler: Literal['standard', 'minmax'] = 'standard', handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center_ohe: bool = False, column_types: Sequence[Literal['categorical', 'continuous']] | None = None)
Column-wise router that applies missing-aware OHE or scaling.
MissingAwareOneHotEncoder ¶
MissingAwareOneHotEncoder(*, handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center: bool = False, dtype: type = float)
One-hot encode categorical columns while respecting missing values.
MissingAwareSparseOneHotEncoder ¶
MissingAwareSparseOneHotEncoder(*, handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center: bool = False, dtype: type = float)
Sparse one-hot encoder for categorical columns.
Assumptions:
- Input is sparse (CSR/CSC) with one column.
- Observed entries are stored; missing entries are absent.
- Categories must be numeric to round-trip through sparse matrices.
- mean_center adjusts stored values per category without
densifying.
MissingAwareStandardScaler ¶
Bases: _BaseScaler
Standardize continuous columns ignoring missing entries.
MissingAwareMinMaxScaler ¶
Bases: _BaseScaler
Scale features to [0, 1] range while ignoring missing entries.
MissingAwareLogTransformer ¶
Apply log1p (or log(x + offset)) while preserving NaN entries.
This is a pre-scaling transform: compress right-skewed features before standardisation so that the standard deviation better reflects the bulk of the data rather than extreme tails.
Args:
offset: Additive constant before taking the log. The default
1.0 gives the standard log1p transform.
MissingAwarePowerTransformer ¶
Yeo-Johnson variance-stabilising transform preserving NaN entries.
Supports mixed-sign data. The transform is applied element-wise
with a per-column lmbda parameter estimated by maximum
likelihood on observed values.
Args:
standardize: If True (default), z-score the transformed
output so each feature has zero mean and unit variance.
MissingAwareWinsorizer ¶
Clip features at fitted percentiles while preserving NaN entries.
Winsorization prevents outlier-driven variance inflation before scaling, so that z-scoring better equalises feature scales.
Args:
lower_quantile: Lower clipping quantile (default 0.01).
upper_quantile: Upper clipping quantile (default 0.99).
Note:
This transform is lossy — inverse_transform is a no-op
(returns data unchanged) because the original tail values
cannot be recovered.
check_data ¶
check_data(x: ndarray, *, column_names: Sequence[str] | None = None, skewness_threshold: float = 2.0, outlier_mad_threshold: float = 5.0, near_zero_var_eps: float = 1e-10, missing_fraction_warn: float = 0.5, warn: bool = False) -> DataReport
Run preflight diagnostics on a data matrix before VBPCA fitting.
Checks focus on scale comparability — conditions that cause individual features to dominate the decomposition — rather than distributional shape.
Args:
x: Data matrix of shape (n_samples, n_features).
column_names: Optional feature names for readable messages.
skewness_threshold: Absolute skewness above which a feature is
flagged (default 2.0).
outlier_mad_threshold: Number of MADs from the median beyond
which an entry is considered an outlier (default 5.0).
near_zero_var_eps: Variance threshold below which a feature is
flagged as near-zero-variance (default 1e-10).
missing_fraction_warn: Per-feature missing fraction above which
a warning is emitted (default 0.5).
warn: If True, also emit :func:warnings.warn for each
issue.
Returns:
A :class:DataReport with warnings, per-feature summary, and
suggested pre-transforms.
DataReport
dataclass
¶
DataReport(warnings: list[str] = list(), summary: dict[str, dict[str, float]] = dict(), suggested_pretransforms: dict[str | int, str] = dict(), passed: bool = True)
Result of :func:check_data preflight validation.
Attributes:
warnings: Human-readable diagnostic messages.
summary: Per-feature statistics dictionary.
suggested_pretransforms: Mapping of column index (or name) to a
suggested transform string (e.g. "log1p").
passed: True when no warnings were raised.