Data Preprocessing¶
Make MLOps Easy ships with a composable preprocessing system that turns raw tabular datasets into model-ready feature matrices. The DataPreprocessor class (easy_mlops/preprocessing/preprocessor.py) is instantiated by the pipeline and can also be reused directly in your code to prepare training or inference data.
High-Level Workflow¶
- Configuration ingestion – The pipeline reads the
preprocessingsection from your YAML (or a Python dict) and instantiatesDataPreprocessorwith it. When no explicit step list is supplied, legacy toggles (handle_missing,encode_categorical,scale_features) are honoured for backwards compatibility. - Data loading –
DataPreprocessor.load_dataaccepts CSV, JSON, and Parquet files. In bespoke scripts you can bypass file IO and callprepare_datawith an in-memoryDataFrame. - Target management –
prepare_data(df, target_column, fit=True)splits off the target series (if provided), keeps feature/target indices aligned, and records the target column name for later reuse. - Step execution – Configured preprocessing steps are executed sequentially. Each step implements the
PreprocessingStepcontract, so you can mix built-in and custom components. During the initial fit each step learns state (e.g. encoders, scalers) before transforming the dataset. - State retention for inference – On the first run
DataPreprocessorsnapshots feature columns, dtypes, encoders, and scalers. Subsequent calls withfit=False(used during deployment and prediction) reuse the fitted state, automatically realigning columns and filling any missing values so that feature matrices remain compatible with the trained model. - Artifact sharing – The fitted preprocessor is saved alongside the model artifacts. Deployment and CLI commands load it back, call
prepare_data(..., fit=False), and guarantee that prediction-time preprocessing mirrors the training-time pipeline.
Step Catalogue and Configuration¶
Configuration entry point¶
All options live under the top-level preprocessing key in your YAML configuration. By default the framework applies the following settings:
preprocessing:
handle_missing: drop # fallbacks to sequential step definitions
encode_categorical: true
scale_features: true
To take full control of the pipeline, declare an explicit steps list. Steps run in the order listed. Items can be a plain string (uses default parameters) or a mapping with type and params keys for explicit configuration:
preprocessing:
steps:
- type: missing_values
params:
strategy: median
- type: categorical_encoder
params:
handle_unknown: ignore
- feature_scaler
Omit a step to skip it altogether. If a steps list is present the legacy flags (handle_missing, encode_categorical, scale_features) are ignored.
Built-in steps¶
Missing Value Handler (missing_values)¶
- Purpose: learns how to handle blanks and applies the same rule during inference.
- Strategies (
strategy): drop(default) – removes rows containing anyNaN.mean,median– compute column-wise statistics on numeric columns and fill with the learned values.mode– fills each column with its most frequent non-null value.constant– usesfill_value(scalar or column mapping) provided in the config.- YAML examples:
yaml
preprocessing:
steps:
- type: missing_values
params:
strategy: constant
fill_value:
age: 0
city: "unknown"
Categorical Encoder (categorical_encoder)¶
- Purpose: label-encodes categorical features with
sklearn.preprocessing.LabelEncoder. - Parameters:
columns– optional list restricting encoding to specific column names. When omitted, all object/category columns are encoded.handle_unknown– controls unseen categories at inference time:use_first(default) – swaps unknown values with the first known class before transforming.ignore– keeps unknown values asNaNafter transformation so that downstream logic can handle them explicitly.
- YAML examples:
yaml
preprocessing:
steps:
- type: categorical_encoder
params:
columns: ["gender", "segment"]
handle_unknown: ignore
During training the encoder stores one LabelEncoder per fitted column. They are exposed through DataPreprocessor.encoders for advanced inspection or custom persistence.
Feature Scaler (feature_scaler)¶
- Purpose: standardises numeric columns using
sklearn.preprocessing.StandardScaler. - Behaviour:
- Automatically discovers numeric columns during
fit. - Reuses the fitted scaler during inference and leaves non-numeric columns untouched.
- Configuration:
- In YAML you typically toggle the step on/off by adding or omitting it from the
stepslist (or setscale_features: falsewhen relying on legacy flags). -
For advanced use cases you can pass a preconfigured scaler when constructing the preprocessor from Python:
```python from sklearn.preprocessing import MinMaxScaler from easy_mlops.preprocessing import DataPreprocessor
preprocessor = DataPreprocessor( {"steps": [{"type": "feature_scaler", "params": {"scaler": MinMaxScaler()}}]} ) ```
Custom steps and extensions¶
The registry pattern lets you add new preprocessing capabilities without modifying the core. Create a subclass that implements fit and transform, set a unique name, and register it:
from easy_mlops.preprocessing import DataPreprocessor, PreprocessingStep
class TextCleaner(PreprocessingStep):
name = "text_cleaner"
def fit(self, df):
return self # no state to learn
def transform(self, df):
df = df.copy()
df["description"] = df["description"].str.lower().str.replace(r"[^a-z0-9 ]", "", regex=True)
return df
DataPreprocessor.register_step(TextCleaner)
Once registered you can reference the new step in your YAML:
preprocessing:
steps:
- text_cleaner
- categorical_encoder
Putting it all together¶
Below is a realistic pipeline configuration that combines multiple built-in options:
preprocessing:
steps:
- type: missing_values
params:
strategy: mode
- type: categorical_encoder
params:
handle_unknown: use_first
- feature_scaler
training:
backend: sklearn
model_type: random_forest_classifier
This configuration first imputes missing values with per-column modes, encodes categorical columns, and standardises numeric features before handing the dataset to the training backend. During deployment the saved preprocessor reproduces the exact transformations so your model continues to receive the feature layout it was trained on.