Input Data Requirements

Use this page together with Panel Data Layout and Scaling and Preprocessing when you prepare a dataset for PanelMMM.

Core contract

For direct Python use, PanelMMM expects:

  • X as a pandas.DataFrame
  • y as a pandas.Series named target_column, or a one-dimensional NumPy array of the same length as X

X must contain the date column, all media columns, and any configured control_columns or dims columns. y carries only the target values.

Role Where it must be present Required Notes
date_column X Yes Normalise to datetimes or parseable date strings.
channel_columns X Yes Every listed channel column must exist in X.
target_column y Yes y.name should match target_column.
control_columns X No If configured, every listed control column must exist in X.
dims X No One column per configured panel dimension, such as geo or brand.

X and y

When you call fit(X, y) or build_model(X, y):

  • Keep the target out of X.
  • Keep X and y row-aligned.
  • If both are pandas objects, keep the same index on both. The shared regression builder checks index equality before fitting.
  • If you pass y as a NumPy array, its length must match len(X).
  • For panel models, each date_column + dims combination must appear exactly once. Duplicate rows are rejected.

Abacus uses target_column as the target name throughout the panel reshape. If y is a Series, its name must match target_column.

Date column

date_column is required in X.

Abacus expects calendar dates, not integer date codes. In practice:

  • Use datetime64[ns] where possible.
  • Parse string dates with pd.to_datetime(...) before fitting when you use the Python API.
  • Do not rely on numeric date values such as 0, 1, 2. Pandas can interpret them as offsets from the Unix epoch, which is usually not what you want.

The YAML builder normalises X[date_column] with pd.to_datetime(...) after loading the dataset. Direct Python use does not add an equivalent preprocessing step for you.

Channel columns

channel_columns is a required constructor argument and must be a non-empty list.

Each listed channel:

  • must be present in X
  • must be fully observed for every row you pass into fit or posterior prediction; Abacus does not silently convert missing channel values to zero
  • should represent the raw media variable that you want the adstock and saturation transformations to consume

Target column

target_column names the dependent variable. It defaults to "y", but you can set a different name such as "sales" or "conversions".

For direct Python use:

  • pass the target as y
  • name the Series with target_column
  • keep the target fully observed; missing target values are rejected rather than zero-filled

For combined-file YAML or pipeline flows:

  • keep the target column in the source dataset
  • Abacus splits it out of the combined dataset before fitting

Control columns

control_columns is optional.

If you configure it, every listed control column must be present in X. Controls stay in the design matrix as separate regressors; they are not part of y.

Like channels, configured controls must be fully observed for every row passed into fit or posterior prediction.

Abacus does not automatically scale controls. See Scaling and Preprocessing.

Panel dimensions with dims

dims is optional. Use it when you want a panel model, for example by geo, brand, or market.

If you set dims=("geo", "brand"):

  • X must contain geo and brand columns
  • each row in X represents one date + geo + brand observation
  • each new date must include every fitted panel slice when you later call posterior-predictive methods with new data

Do not use reserved internal names in dims:

  • date
  • channel
  • control
  • fourier_mode

For row layout and rectangularity guidance, see Panel Data Layout.

Supported shapes and alignment

Workflow Supported shape
Direct PanelMMM.fit() / build_model() X: DataFrame; y: Series or 1D ndarray
YAML builder with data.dataset_path One tabular file containing both predictors and the target column
Pipeline runner with dataset_path Same as above
Pipeline runner with x_path and y_path Separate feature and target files; the runner extracts target_column from the target file

Abacus also has an internal alignment helper that can work with a MultiIndex target Series indexed by [date_column, *dims], but that is mainly used in fit-data rebuild and load flows. For normal fitting, keep y row-aligned with X.

Python example

import pandas as pd

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM

dataset = pd.DataFrame(
    {
        "date": pd.to_datetime(
            ["2025-01-06", "2025-01-06", "2025-01-13", "2025-01-13"]
        ),
        "geo": ["UK", "US", "UK", "US"],
        "tv": [120.0, 150.0, 125.0, 152.0],
        "search": [40.0, 55.0, 42.0, 58.0],
        "price_index": [1.02, 0.99, 1.01, 1.00],
        "sales": [820.0, 910.0, 835.0, 925.0],
    }
)

X = dataset.drop(columns=["sales"])
y = dataset["sales"].rename("sales")

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    control_columns=["price_index"],
    dims=("geo",),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

mmm.fit(X, y)

YAML note

If you use a combined dataset in YAML, the file at data.dataset_path must contain every configured column:

  • date_column
  • every entry in channel_columns
  • every entry in control_columns, if any
  • every entry in dims, if any
  • target_column

Example:

data:
  dataset_path: panel_dataset.csv
  date_column: date

target:
  column: sales
  type: revenue

dimensions:
  panel: [geo]

media:
  channels: [tv, search]
  controls: [price_index]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

Common pitfalls

  • Missing date_column, channel, control, or dimension columns in X
  • Passing a y Series whose name does not match target_column
  • Passing pandas X and y with different indexes
  • Passing a NumPy y with a different length from X
  • Passing duplicate panel rows or incomplete panel slices for a given date
  • Passing missing observed channel, control, or target values and expecting Abacus to treat them as structural zeroes
  • Expecting the YAML builder or pipeline to find a target column that is not present in the combined dataset
  • Leaving date values as numeric codes instead of normalising them first