Skip to content

Preprocessing

AutoEncoder

AutoEncoder(*, cardinality_threshold: int = 20, continuous_scaler: Literal['standard', 'minmax'] = 'standard', handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center_ohe: bool = False, column_types: Sequence[Literal['categorical', 'continuous']] | None = None)

Column-wise router that applies missing-aware OHE or scaling.

MissingAwareOneHotEncoder

MissingAwareOneHotEncoder(*, handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center: bool = False, dtype: type = float)

One-hot encode categorical columns while respecting missing values.

MissingAwareSparseOneHotEncoder

MissingAwareSparseOneHotEncoder(*, handle_unknown: Literal['ignore', 'raise'] = 'ignore', mean_center: bool = False, dtype: type = float)

Sparse one-hot encoder for categorical columns.

Assumptions: - Input is sparse (CSR/CSC) with one column. - Observed entries are stored; missing entries are absent. - Categories must be numeric to round-trip through sparse matrices. - mean_center adjusts stored values per category without densifying.

MissingAwareStandardScaler

MissingAwareStandardScaler()

Bases: _BaseScaler

Standardize continuous columns ignoring missing entries.

MissingAwareMinMaxScaler

MissingAwareMinMaxScaler()

Bases: _BaseScaler

Scale features to [0, 1] range while ignoring missing entries.

MissingAwareLogTransformer

MissingAwareLogTransformer(*, offset: float = 1.0)

Apply log1p (or log(x + offset)) while preserving NaN entries.

This is a pre-scaling transform: compress right-skewed features before standardisation so that the standard deviation better reflects the bulk of the data rather than extreme tails.

Args: offset: Additive constant before taking the log. The default 1.0 gives the standard log1p transform.

fit

fit(x: ndarray, mask: Mask | None = None) -> MissingAwareLogTransformer

Record input width (stateless transform).

transform

transform(x: ndarray, mask: Mask | None = None) -> np.ndarray

Apply log(x + offset) to observed entries.

inverse_transform

inverse_transform(z: ndarray, mask: Mask | None = None) -> np.ndarray

Reverse via exp(z) - offset.

MissingAwarePowerTransformer

MissingAwarePowerTransformer(*, standardize: bool = True)

Yeo-Johnson variance-stabilising transform preserving NaN entries.

Supports mixed-sign data. The transform is applied element-wise with a per-column lmbda parameter estimated by maximum likelihood on observed values.

Args: standardize: If True (default), z-score the transformed output so each feature has zero mean and unit variance.

fit

fit(x: ndarray, mask: Mask | None = None) -> MissingAwarePowerTransformer

Estimate per-column Yeo-Johnson lambda by profile MLE.

transform

transform(x: ndarray, mask: Mask | None = None) -> np.ndarray

Apply Yeo-Johnson transform and optional standardisation.

inverse_transform

inverse_transform(z: ndarray, mask: Mask | None = None) -> np.ndarray

Reverse transform: un-standardise then invert Yeo-Johnson.

MissingAwareWinsorizer

MissingAwareWinsorizer(*, lower_quantile: float = 0.01, upper_quantile: float = 0.99)

Clip features at fitted percentiles while preserving NaN entries.

Winsorization prevents outlier-driven variance inflation before scaling, so that z-scoring better equalises feature scales.

Args: lower_quantile: Lower clipping quantile (default 0.01). upper_quantile: Upper clipping quantile (default 0.99).

Note: This transform is lossy — inverse_transform is a no-op (returns data unchanged) because the original tail values cannot be recovered.

fit

fit(x: ndarray, mask: Mask | None = None) -> MissingAwareWinsorizer

Compute per-column clipping bounds from observed values.

transform

transform(x: ndarray, mask: Mask | None = None) -> np.ndarray

Clip observed entries to fitted bounds.

inverse_transform

inverse_transform(z: ndarray, mask: Mask | None = None) -> np.ndarray

No-op: winsorization is lossy.

Returns a copy of the input unchanged.

check_data

check_data(x: ndarray, *, column_names: Sequence[str] | None = None, skewness_threshold: float = 2.0, outlier_mad_threshold: float = 5.0, near_zero_var_eps: float = 1e-10, missing_fraction_warn: float = 0.5, warn: bool = False) -> DataReport

Run preflight diagnostics on a data matrix before VBPCA fitting.

Checks focus on scale comparability — conditions that cause individual features to dominate the decomposition — rather than distributional shape.

Args: x: Data matrix of shape (n_samples, n_features). column_names: Optional feature names for readable messages. skewness_threshold: Absolute skewness above which a feature is flagged (default 2.0). outlier_mad_threshold: Number of MADs from the median beyond which an entry is considered an outlier (default 5.0). near_zero_var_eps: Variance threshold below which a feature is flagged as near-zero-variance (default 1e-10). missing_fraction_warn: Per-feature missing fraction above which a warning is emitted (default 0.5). warn: If True, also emit :func:warnings.warn for each issue.

Returns: A :class:DataReport with warnings, per-feature summary, and suggested pre-transforms.

DataReport dataclass

DataReport(warnings: list[str] = list(), summary: dict[str, dict[str, float]] = dict(), suggested_pretransforms: dict[str | int, str] = dict(), passed: bool = True)

Result of :func:check_data preflight validation.

Attributes: warnings: Human-readable diagnostic messages. summary: Per-feature statistics dictionary. suggested_pretransforms: Mapping of column index (or name) to a suggested transform string (e.g. "log1p"). passed: True when no warnings were raised.