Pipeline Architecture¶
easy_mlops.pipeline.MLOpsPipeline is the backbone of Make MLOps Easy. It strings together configuration loading, data preprocessing, model training, deployment, and observability into a deterministic workflow. The pipeline can be executed directly from Python or indirectly through the distributed runtime (CLI → master → worker → pipeline).
config.yaml ─┐
▼
┌───────────────┐ ┌──────────────┐
data ─▶│ DataPreprocessor│ ──▶ │ ModelTrainer │
└───────────────┘ └──────┬───────┘
│
▼
┌────────────────┐
│ ModelDeployer │
└──────┬─────────┘
│
▼
┌────────────────┐
│ ModelMonitor │
└────────────────┘
Each component is configurable, testable in isolation, and extensible through registries.
Execution flow¶
- Configuration –
Configmerges defaults with an optional YAML file. The resulting dictionary is passed to every subsystem so defaults stay centralised. - Preprocessing –
DataPreprocessorloads the dataset, applies configuredPreprocessingStepinstances, records feature metadata, and returns(X, y). During inference it reuses the fitted state and reorders columns to match training time. - Training –
ModelTrainerdelegates to a registeredBaseTrainingBackend. The default scikit-learn backend detects the problem type, builds an estimator, runs train/test splits, computes metrics, and (optionally) cross-validates. The backend yields aTrainingRunResultcontaining the fitted model and metadata. - Deployment –
ModelDeployerassemblesDeploymentStepinstances (create directory, save model, persist preprocessor, write metadata, optional endpoint script). Steps share a mutableDeploymentContext, making the pipeline easy to extend. - Observability –
ModelMonitortriggersObservabilityStephooks to log metrics and predictions, manage thresholds, and generate summaries. Logs are written to JSON files under the deployment directory.
The pipeline returns a dictionary with preprocessing, training, deployment (if enabled), and logs metadata so callers can chain additional automation.
Distributed runtime integration¶
Workers inside easy_mlops.distributed.worker call MLOpsPipeline through TaskRunner:
traintasks instantiate the pipeline, callrun, and stream the printed progress back to the master.predict,status, andobservetasks hydrate the same pipeline to load artifacts from a deployment directory.
This design keeps the CLI, distributed runtime, and in-process usage aligned—one implementation powers every entry point.
Configuration surface¶
Every stage is configured via the YAML file consumed by Config. The top-level keys map directly to pipeline components:
preprocessing– toggles legacy options (handle_missing,encode_categorical,scale_features) or declares an explicitstepslist. Steps refer to names registered inDataPreprocessor.STEP_REGISTRY.training– defines the backend (sklearn,neural_network,callable,deep_learning,nlp) and its parameters (model_type,test_size,cv_folds,random_state, etc.). Custom backends read arbitrary keys from this section.deployment– controls artifact destinations, filenames, and optional endpoint generation. Advanced users can supply a customstepslist to inject additional deployment logic (for example, uploading to S3).observability– toggles metric/prediction logging and configures thresholds. Like the other components it supports an explicitstepslist to fine-tune monitoring.
make-mlops-easy init builds a skeleton file that mirrors the defaults in easy_mlops/config/config.py.
Component deep dive¶
- DataPreprocessor (
easy_mlops/preprocessing/preprocessor.py) wraps step classes defined ineasy_mlops/preprocessing/steps.py. Steps follow a minimal contract (fit,transform, optionalsave/loadhooks). Custom steps can be registered globally viaDataPreprocessor.register_step. - ModelTrainer (
easy_mlops/training/trainer.py) wraps the backend registry ineasy_mlops/training/backends.py. The built-in scikit-learn backend covers random forests, logistic/linear regression, XGBoost, and MLPs. Callable-style backends let you plug in frameworks such as PyTorch or TensorFlow. - ModelDeployer (
easy_mlops/deployment/deployer.py) executes a list ofDeploymentStepinstances. The default stack creates a timestamped directory (deployment_YYYYMMDD_HHMMSS), serialises artifacts withjoblib, writesmetadata.json, and optionally generates an executable prediction helper. - ModelMonitor (
easy_mlops/observability/monitor.py) managesObservabilityStepimplementations such asMetricsLoggerStep,PredictionsLoggerStep, andMetricThresholdStep. Steps persist their own state and can expose summaries for CLI rendering.
Each subsystem exposes registration helpers so you can inject new behaviour without forking the project.
Testing the pipeline¶
tests/test_pipeline.py exercises the end-to-end flow with sample data. Additional unit tests cover the preprocessor, trainer backends, deployment steps, observability pipeline, and distributed state store. Use these tests as a blueprint when contributing new components.
Extension ideas¶
- Add a LightGBM training backend that consumes hyperparameters from
training:. - Register a preprocessing step that enriches features from an external service.
- Extend deployment with a step that pushes artifacts to an object store and records the URI in
DeploymentContext.artifacts. - Register an observability step that forwards metrics to Prometheus or Slack.
See development guidelines for tips on structure, testing, and documentation updates when adding new pieces.