Skip to content

Data Schema & Validation

Before You Begin

Because BidOptic is a zero-egress system, we cannot look at your data to tell you if it is formatted correctly. Before initiating an Evaluation Agreement, we ask prospective Design Partners to validate a sample of their historical logs locally.

👉 Go to the BidOptic Local Schema Validator Repository

The validator is a standalone, open-source Python script. It requires no BidOptic licence, runs entirely locally on your machine, and produces a pass/fail report in under 60 seconds. Fix any [BLOCKER] errors it raises before scheduling your calibration call.


Required Schema

Your extract must be a single flat CSV or Parquet file containing exactly these 11 columns. Column names are case-sensitive.

Column Type Description
timestamp ISO 8601 Datetime UTC timestamp of the bid request. Used to reconstruct temporal patterns, hourly behaviour profiles, and recency decay curves.
user_id String / Integer Your internal user identifier. BidOptic remaps these to anonymous sequential integers during ingestion. The original IDs are never stored.
publisher_id String Your internal publisher or supply-source identifier. Used to train publisher-level floor price models and quality signals.
ad_size String Creative dimension or format (e.g. 300x250, 728x90, native, video). Used as a feature in CTR and floor modelling.
bid_price Float (USD) The price your DSP submitted for this auction. Must be greater than zero for won auctions.
clearing_price Float (USD) The price you actually paid on won auctions. Set to 0.0 on lost auctions. Used to calibrate second-price dynamics and margin.
is_won Integer (0 / 1) Whether your bid won the auction. Used to train the win-rate model and derive publisher floor estimates.
is_clicked Integer (0 / 1) Whether the impression resulted in a click. The primary training signal for the CTR model.
is_converted Integer (0 / 1) Whether the impression chain resulted in a conversion event. The primary training signal for the CVR and LTV models.
conversion_timestamp ISO 8601 Datetime (nullable) UTC timestamp of the conversion event. Null for non-converting rows. Required for the conversion delay model. Rows with no conversions at all should leave this column entirely null.
conversion_value Float (USD) Revenue attributed to this conversion. Set to 0.0 for non-converting rows. Used by the LTV (Tweedie) model. Note: If your data contains no revenue variance (e.g., all 0s or 1s), the system automatically enters Binary Conversion Mode. The LTV model is disabled, and every conversion is fixed at $1.00. This is ideal for CPA-focused campaigns.

Optional column: bid_latency_ms (Float, milliseconds) — round-trip time from bid request receipt to bid response submission, as observed by your DSP. If present, it trains the Latency Twin directly from your infrastructure data. If absent, a lognormal distribution is synthesized from market priors and flagged in the calibration audit output.


Hard Limits

Minimum 100,000 rows required

Minimum 100,000 rows required. Datasets below this threshold do not provide sufficient statistical coverage for the CTR, CVR, and floor price models to produce reliable calibrations. The calibration pipeline will abort with a hard error if this threshold is not met.

Constraint Value Impact if Violated
Minimum row count 100,000 rows Calibration aborted
Minimum conversion count 50 conversions Calibration aborted
ML CVR/LTV Threshold 100 conversions If between 50-99 conversions, ML models are disabled and simulation falls back to tabular segment-mean estimates
Maximum null rate on critical columns 5% Calibration aborted for the affected column
Maximum null rate on non-critical columns 20% Warning issued; affected model accuracy degrades
Minimum win rate 0.1% Calibration aborted (likely a pre-filtered dataset)
Maximum conversion_timestamp null rate (among converted rows) 20% Warning issued; conversion delay model accuracy degrades

Critical columns (5% null hard limit): bid_price, publisher_id, timestamp, is_won.

Non-critical columns (20% null soft limit): clearing_price, is_clicked, conversion_value, ad_size. Null ad_size values are expected for native and video inventory and are relabelled unknown internally — no action is required if they reflect your inventory mix.


Notes on Data Preparation

Time range. We recommend a minimum of 14 days of data and a maximum of 90 days. Data older than 90 days may reflect market conditions that no longer apply. The calibration report includes a freshness warning if the dataset end date is more than 30 days before the calibration run date.

Sampling. Do not downsample by outcome. If you subsample to reduce file size, use random sampling across all rows. Outcome-stratified samples (e.g. keeping only won impressions) will produce a miscalibrated win-rate model.

User IDs. You may hash or otherwise pseudonymise user IDs before providing the extract. BidOptic remaps all IDs to sequential integers during ingestion regardless.

Currency. All price columns must be denominated in USD. If your DSP logs in CPM, divide by 1,000 before export.