This section covers the structured abacus.pipeline runner: how it loads a
config and dataset, executes the retained stage sequence, and writes
reproducible run artefacts to disk.
Pages
Runner Overview - How run_pipeline(...) works, which
stages run, and when the optimisation stage is skipped.
YAML Configuration - Which YAML keys the runner
consumes and how they map to model build, data loading, holidays, and
optimisation.
CLI Reference - The thin python -m abacus.pipeline.runner
interface and its supported flags.
Output Directory Schema - The run directory
layout, manifest schema, stage statuses, and main artefacts.
Extending the Runner - How to add a stage or wire
in reporting without bypassing the manifest and artifact helpers.
Subsections of Pipeline Runner
Runner Overview
Use the pipeline runner when you want a full disk-backed PanelMMM run instead
of only an in-memory fit.
The runner loads a YAML config and a CSV dataset, builds the model, executes a
fixed stage sequence, and writes each stage’s artefacts into a structured run
directory. When validation is enabled, the runner performs a second train-window
fit for the blocked holdout stage, so the run takes longer than a pure
full-sample fit.
The path to run_manifest.json inside that directory
What the runner does
run_pipeline(...) performs these steps:
Load the YAML config with load_yaml_config(...).
Load X and y from CSV using load_pipeline_data(...).
Merge CLI sampler overrides with YAML fit through
build_model_kwargs(...).
Create the output directory tree and initialise run_manifest.json.
Run the retained stages in order, updating the manifest after every stage.
The model is built in Stage 00 by build_mmm_from_yaml(...), then stored in the
shared PipelineContext for the remaining stages. Runner-only roots such as
diagnostics and validation stay on the pipeline context and are stripped
before the public MMM builder validates the model YAML.
Stage order
The runner uses a fixed stage list.
Stage key
Directory
Purpose
Optional
metadata
00_run_metadata
Build the model and write resolved config and dataset metadata
No
preflight
10_pre_diagnostics
Prior predictive draws and plot
No
fit
20_model_fit
Fit the model, save InferenceData, write trace and summary
Raw input screening, MCMC, predictive, and residual diagnostics
No
curves
60_response_curves
Saturation-only, forward-pass direct contribution, and adstock curve artefacts
No
optimisation
70_optimisation
Budget optimisation artefacts
Yes
The validation stage is marked skipped when the YAML config does not contain
validation or it is disabled. The optimisation stage is also optional; it
returns None and is marked skipped when the YAML config does not contain an
optimization block.
PipelineRunConfig controls runtime settings that sit outside the YAML model
specification.
Field
Purpose
config_path
YAML file to load
output_dir
Root directory under which the run directory is created
run_name
Optional run-name override; otherwise the config filename stem
dataset_path
Optional combined dataset CSV override
x_path, y_path
Optional feature and target CSV overrides
holidays_path
Optional holiday CSV override
target_column
Target column name used during CSV loading
prior_samples
Number of prior predictive samples for Stage 10
draws, tune, chains, cores, random_seed
Sampler overrides merged onto YAML fit
curve_samples, curve_points
Curve sampling settings for Stage 60
Only sampler settings are merged into model construction. Other overrides are
used by the runner itself during data loading, holiday resolution, diagnostics
reporting, and output setup.
The pipeline runner reads the same YAML model specification used by
build_mmm_from_yaml(...), then adds a small set of runner-specific conventions
for data loading, optional blocked holdout validation, and Stage 70
optimisation.
This page documents the keys that the runner actually consumes.
Root keys
Key
Required
Used for
data
Usually
Resolve dataset paths when you do not pass dataset_path, x_path, or y_path through PipelineRunConfig
target
Yes
Define the target column and business target type
dimensions
No
Declare panel-dimension columns such as geo or brand
media
Yes
Define channel/control columns and transform types
scaling
No
Configure target/channel scaling rules
effects
No
Append additive effects in YAML order before build_model(...)
priors
No
Override model-level priors and prefixed transform priors
fit
No
Default sampler settings for Stage 20 fitting
holidays
No
Add holiday events before model build
original_scale_vars
No
Add original-scale contribution variables before fitting
inference_data
No
Attach existing InferenceData when the file exists
The builder appends each effect to model.mu_effects in YAML order before
calling build_model(...).
holidays
The holidays block is optional.
Supported keys used by the builder include:
Key
Meaning
path
Holiday CSV path
enabled
Set to false to disable holiday loading
prefix
Prefix for generated holiday effect coordinates
countries
Optional country filter for catalogue-style holiday CSV input
Example:
holidays:path:"holidays.csv"prefix:"holiday"
The CLI or PipelineRunConfig.holidays_path overrides holidays.path.
If you omit both path and the override but still configure holidays,
Abacus falls back to the bundled abacus.data:holidays.csv.
original_scale_vars
Use original_scale_vars when you want specific contribution variables to be
available on the original target scale:
original_scale_vars:- channel_contribution- y
The builder applies these through
model.add_original_scale_contribution_variable(...) before fitting.
inference_data
inference_data.path is passed through to the YAML builder. If the file exists, Abacus
attaches that InferenceData to the built model during Stage 00.
Important: the structured runner still executes Stage 20 and fits the model
again. inference_data.path does not currently skip fitting.
optimization
Add an optimization block when you want Stage 70 to run. If this block is
absent, Stage 70 is marked skipped.
The YAML builder validates this block when the config is loaded. The required
scalar fields below must be present, and unknown top-level optimization keys
are rejected.
Set to false to skip Stage 35 while keeping the stage in the manifest
holdout_observations
Number of unique dates to reserve for the blocked holdout window
include_last_observations
Keep lag history for carryover-sensitive holdout scoring
coverage_levels
Coverage levels reported in Phase 10; use the fixed 50, 80, and 94 percent defaults
sampler
Optional validation-only sampler overrides for the train-window refit
Phase 10 reports coverage as coverage_50, coverage_80, and
coverage_94. Keep those defaults unless the implementation and tests are
updated together.
The validation stage builds a clean train-window model for holdout scoring and
ignores inference_data.path so the refit does not inherit attached posterior
state from Stage 00.
Override precedence
For the runner, precedence is:
Setting
Higher precedence
Lower precedence
Combined dataset path
dataset_path / --dataset-path
data.dataset_path
Split CSV paths
x_path, y_path / --x-path, --y-path
data.x_path, data.y_path
Holiday CSV path
holidays_path / --holidays-path
holidays.path
Sampler settings
PipelineRunConfig or CLI overrides
fit
Target column for CSV loading
target_column / --target-column
target.column, then "y"
Diagnostics thresholds
diagnostics.thresholds
retained Stage 50 defaults
Common pitfalls
Using Parquet paths in the pipeline data block. The runner data loader reads
CSV only.
Providing only one of data.x_path or data.y_path.
Treating optimization.total_budget as total horizon spend instead of
per-period spend.
Assuming diagnostics is part of the public MMM builder schema. It is a
runner-only block.
Assuming inference_data.path skips Stage 20 fitting. It does not.
Forgetting that relative paths are resolved from the YAML file directory, not
from the shell working directory.
Output Directory Schema
Each pipeline run creates a timestamped directory under the configured
output_dir:
<output_dir>/<run_name>_<YYYYMMDD_HHMMSS>
The timestamp is generated in UTC. The runner creates every stage directory up
front, then updates run_manifest.json as stages start, complete, skip, or
fail.
a copy of the original config under its source filename
config.original.yaml
config.resolved.yaml
session_info.txt
dataset_metadata.json
model_metadata.json
data_dictionary.csv
design_matrix_manifest.csv
spec_summary.csv
holiday_feature_manifest.csv when holidays are configured
config.resolved.yaml normalises configured data and holiday paths to absolute
paths and records the effective sampler configuration on the model.
10_pre_diagnostics
Main files:
prior_predictive.nc
prior_predictive.png
20_model_fit
Main files:
model.nc
trace.png
posterior_summary.csv
30_model_assessment
Main files:
posterior_predictive.nc
posterior_predictive.png
posterior_predictive_summary.csv
observed.csv
fitted.csv
fit_timeseries.png
fit_scatter.png
residuals.csv
residuals_timeseries.png
residuals_hist.png
residuals_vs_fitted.png
This stage is the in-sample or training-fit assessment. It uses the same data
the model was fit on and should not be read as the pipeline’s out-of-sample
validation layer.
35_holdout_validation
Main files:
validation_metadata.json
holdout_posterior_predictive.nc
holdout_predictive_summary.csv
holdout_predictive_report.json
holdout_observed.csv
holdout_fitted.csv
holdout_residuals.csv
holdout_timeseries.png
holdout_residuals_acf.png
The holdout summary and report include uncertainty-aware metrics such as
crps, bias, and fixed coverage columns for coverage_50, coverage_80,
and coverage_94.
This stage is optional. When validation is absent or disabled in YAML, the
directory still exists and the stage is marked skipped.
40_decomposition
Main files:
waterfall_components_decomposition.png
weekly_media_contribution.png
channel_contributions.csv
baseline_contributions.csv
mean_contributions_over_time.csv
50_diagnostics
Main files:
design_summary.csv
design_report.json
vif_report.csv
mcmc_summary.csv
mcmc_report.json
predictive_summary.csv
predictive_report.json
residual_diagnostics.csv
residuals_acf.png
diagnostics_report.csv
diagnostics_summary.txt
chain_diagnostics.txt
The design-oriented files are raw input screening outputs. In particular,
diagnostics_report.csv labels the corresponding phase as
raw_input_screening rather than design.
60_response_curves
Main files:
saturation_curve.nc
saturation_curve_summary.csv
saturation_curve.png
forward_pass_contribution_curve.nc
forward_pass_contribution_curve_summary.csv
forward_pass_contribution_curve.png
adstock_curve.nc
adstock_curve_summary.csv
adstock_curve.png
These artefacts are intentionally different:
saturation_curve.* is the sampled saturation transformation on the scaled
channel axis, exported with original-scale contribution values for easier
reading. The PNG overlays that saturation-only curve against posterior mean
realised contributions.
forward_pass_contribution_curve.* is a full-model direct contribution
artefact. It rescales the observed historical spend path from 0% to 200%,
runs that spend through the fitted adstock and saturation path, and records
the resulting total channel contribution in original target units.
adstock_curve.* is the sampled carryover-weight profile for one impulse.
70_optimisation
This directory is present for every run, but the stage is skipped unless the
YAML config contains an optimization block.
Main files when the stage runs:
optimized_allocation.nc
optimized_allocation.csv
response_distribution.nc
optimize_result.json
budget_summary.csv
budget_response_points.csv
budget_impact.csv
budget_bounds_audit.csv
budget_roi_cpa.csv
budget_response_curves.csv
budget_mroi.csv
budget_optimisation.json
several PNG plots for allocation, contribution over time, response curves,
impact, bounds audit, and ROI or CPA
run_manifest.json
The manifest is the machine-readable index for the whole run.
Top-level fields include:
Field
Meaning
run_name
Effective run name
timestamp
UTC run timestamp
config_path
Original config path
output_dir
Run directory path
status
Overall run status
model_class
Set after Stage 00 builds the model
data
Basic dataset metadata
stages
Per-stage manifest records
warnings
Run-level warnings
error
Run-level failure payload when the pipeline aborts
data includes:
x_shape
y_length
target_column
x_columns
Stage records
Each stage record contains:
Field
Meaning
directory
Stage directory name
status
Current stage status
started_at
ISO timestamp when the stage started
finished_at
ISO timestamp when the stage finished
artifacts
Mapping of artefact labels to root-relative paths
warnings
Stage warnings
error
Error string when the stage fails
The artifacts mapping uses root-relative paths such as
20_model_fit/model.nc.
Stage statuses
Status
Meaning
pending
Stage has not started yet
running
Stage is currently running
completed
Stage finished successfully
skipped
Stage returned None intentionally
failed
Stage raised an exception
not_reached
A previous stage failed before this one ran
Common cases:
Stage 35 is skipped when validation is missing or disabled from YAML.
Stage 70 is skipped when optimization is missing from YAML.
Later stages become not_reached after the first failure.
Practical use
Use the run directory when you want:
a stable folder for downstream reporting
a machine-readable audit trail through run_manifest.json
stage-level links to artefacts without hard-coding filenames
The retained runner is static, not plugin-based. To add a stage or integrate
custom status reporting, extend the existing runner surfaces instead of
bypassing them.
return a dict[str, str] of artefact labels to root-relative paths when the
stage succeeds
return None when the stage is intentionally skipped
raise an exception when the stage fails and should abort the run
The runner handles manifest updates around the stage call. Do not update
context.manifest directly from a normal stage implementation unless you are
changing core runner behaviour.
What is available in PipelineContext
PipelineContext gives each stage access to:
Field
Use it for
run_config
Runtime settings such as output root, seeds, and curve sample counts
raw_cfg
The loaded YAML config as a mutable mapping
X, y
Loaded dataset inputs
paths
Stage directories and manifest path
manifest
Current run manifest
model_kwargs
Effective sampler overrides passed into model build
Use context.paths.relative(path) when building the artefact mapping that the
stage returns. The manifest expects root-relative paths, not absolute paths.
fromabacus.pipeline.artifactsimportwrite_dataframedefrun_custom_stage(context):ifcontext.modelisNone:raiseValueError("Model has not been initialized before the custom stage.")stage_dir=context.paths.stage_dirs["custom"]output_path=stage_dir/"custom_summary.csv"frame=context.model.summary.total_contribution(output_format="pandas")write_dataframe(output_path,frame)return{"custom_summary":context.paths.relative(output_path),}
Optional stage pattern
If a stage should only run when a config block is present, follow the same
pattern as Stage 70: