Research Memo 001 v1.0.1 | Inflation Arena Canada

Thesis

The benchmark only becomes interesting if the agent allocates context better than public forecasters over time.

Inflation Agent should not publish a different number just to look active. It should allocate attention better.

The forecast can remain at 2.7% for several days if the evidence is stable. The goal is to decide how much attention belongs to history, the current state, the future path, and new research. v1.0.1 makes that context allocation explicit through forecast state, evidence records, decisions, skill governance, and methodology releases.

Before Daily publication check

A fallback script can keep the site current, but it cannot prove that a real research pass happened.

v1.0.1 Evidence-state-decision loop

Daily entries can now be traced to structured evidence, public forecast state, and a hold or revise decision.

Target Scored public benchmark

Locked calls, external forecaster captures, official CPI actuals, postmortems, and curator releases create the track record.

Diagnosis

The prior automation was useful but too shallow; v1.0.1 changes the operating model.

The site had a clean public arena: a current forecast, a release clock, a leaderboard shell, source links, validation tests, and a daily fallback publisher. That was enough for public continuity. It was not enough for a self-improving forecasting agent. v1.0.1 adds the missing public structure around state, evidence, decisions, skills, and methodology releases.

What is working

Public forecast artifacts are simple and inspectable.
The release lock is explicit before the Statistics Canada bell.
The validation layer blocks malformed public JSON and unsafe fields.
The fallback publisher prevents the site from silently going stale.

What is missing

No immutable lock file or post-release scoring artifact yet.
No historical backtest or challenger promotion gate yet.
No automated competitor forecast capture yet.
No fully autonomous release curator proven over multiple rounds yet.

Sources

Research inputs and how they changed the project direction.

Nof1 / Alpha Arena

Live AI trading benchmark

Outcome benchmarks are more credible when models act under common rules, common inputs, public outputs, and real scoring.

Inflation Agent needs locked calls, common scoring rules, public decision summaries, and a leaderboard against consensus and banks.

Hermes Agent

Closed learning loop

Hermes treats skills as procedural memory, runs scheduled work as fresh jobs, supports subagents, and keeps a persistent agent surface.

Inflation Agent v1.0.1 adds public skill-family governance, bounded context packets, and recurring loops that can evolve without exposing private prompts.

Hermes Curator

Skill maintenance

The Curator pattern grades, consolidates, prunes, and reports on skills on a schedule, with scoped permissions and archives rather than uncontrolled mutation.

Inflation Agent adopts the principle as a methodology curator: review misses, user feedback, stale evidence, and skill performance before publishing versioned changes.

Forecast Hub

Timestamped forecast repository

Forecasts should be timestamped, formatted consistently, paired with metadata, and evaluated from the version that was actually public at the time.

Add immutable lock artifacts, methodology versions, and late-submission rules before the first scored CPI round.

Statistics Canada

Official scoring authority

The official CPI release defines the target, and the June 2026 basket update creates an unusually important source of forecast uncertainty.

Anchor scorekeeping to StatCan, track basket-weight changes explicitly, and keep public calls tied to the correct reference month.

Bank of Canada

Inflation indicator practice

Inflation pressure is assessed through a broad indicator set, not one CPI release in isolation.

The daily scout should include official data, public forecaster calls, labour, energy, shelter, demand, and credible alternative public signals.

Forecast evaluation literature

Rolling origins and proper scoring

Forecast evaluation must avoid hindsight leakage and should use scoring rules that reward honest forecasts.

Start with absolute miss for readability, then add calibration and probabilistic scores once the forecast distribution is mature.

Agent benchmark papers

Architecture matters

Live agent results depend on context management, risk posture, and decision architecture, not only the base model.

Inflation Agent needs a harness with context allocation, specialist passes, abstention discipline, and postmortem-driven versioning.

Architecture

The v1.0.1 harness separates evidence, state, decisions, skills, releases, locks, and scores.

Scout

Gather official CPI sources, public consensus, bank forecasts, labour, energy, shelter, food, mortgage, FX, shipping, and other sourceable inflation signals.

Ledger

Normalize every observation into an append-only evidence record with source, timestamp, category, freshness, reliability, direction, magnitude, and forecast implication.

State

Maintain the accumulated forecast state: current point call, uncertainty range, context allocation, active hypotheses, open questions, stale signals, revision threshold, and model version.

Decision

Compare today's evidence against yesterday's state and output a structured decision: hold, revise, widen range, narrow range, lock, score, or postmortem.

Publish

Publish a compact public rationale. Do not publish private prompts, scratchpads, or internal chain-of-thought. Do publish the sources, drivers, decision, and changed evidence.

Improve

After official CPI releases or significant user feedback, score the round, classify the miss, review skill performance, and promote only changes that survive validation and public rationale.

Loop

Daily research should compound, but it should not overfit to noise.

Research pass

The daily run should check whether new evidence exists, not merely summarize old links. If nothing material changed, the entry should say that in a precise way: no new official release, no fresh consensus move, no component signal strong enough to revise.

Revision discipline

The forecast should move only when the estimated evidence delta exceeds the threshold. A 0.03 percentage point labour signal should not create a public 0.1 point revision unless it compounds with other signals.

Context allocation

Prior evidence should remain available in the forecast state so the model is not re-litigating the launch call every morning. New information should update the state, contradict it, expire it, or change the research budget.

Fallback honesty

Fallback entries should be labeled as fallback. They are operational continuity, not research. This protects the public record from confusing freshness with forecasting work.

Boundary

Open the benchmark surface, not every private skill.

Public by default

Forecast numbers and timestamps.
Source links and public evidence summaries.
Lock times and scorekeeping rules.
Round results, misses, and postmortems.
Methodology version summaries.

Closed while early

Exact private prompts and research skills.
Scratchpads and internal chain-of-thought.
Experimental challenger loops before promotion.
Private automation credentials or implementation details.
Licensed data that cannot be redistributed.

The right transparency target is reproducible scorekeeping, not full prompt disclosure. People should be able to judge whether the calls were made on time, what public evidence was cited, and how the score was computed.

Build

v1.0.1 ships the public schema; the next step is proving it through autonomous operation.

Shipped Make fallback explicit

Add `entry_type`, `decision`, `new_evidence_count`, and `revision_threshold_pp` to daily public entries.

Shipped Add forecast state

Introduce `data/forecast_state.json` with active hypotheses, open questions, stale signals, and uncertainty range.

Shipped Add evidence ledger

Introduce `data/evidence_log.json` and require real research entries to cite evidence records.

Next Lock and score

Create immutable `data/rounds/<period>/locked_forecast.json` and post-release score artifacts.

Next Run challengers

Operate challenger methods privately or separately, then promote only if postmortem evidence supports the change.

v1.0.1 Publish seasons

Version methodology changes like benchmark seasons so long-run performance can be compared fairly. v1.0.1 adds the first methodology release record.

Refs

Primary and methodological references used for this revision.

Agent self-improvement Hermes Agent README

Used for the closed learning loop pattern: skills, memory nudges, cron jobs, subagents, and persistent agent operation.

Agent self-improvement Hermes v0.12 Curator release

Used for the curator model: scheduled skill grading, consolidation, pruning, archived reports, and scoped self-improvement.

Agent surface Hermes v0.16 Surface release

Used for the release-surface lesson: autonomous systems need versioned public surfaces, not only private internal changes.

Skill evolution Hermes Agent Self-Evolution

Used cautiously as an offline optimization reference for skills, prompts, and code. It is not treated as production live self-modification.

Benchmark design Alpha Arena

Used for the live benchmark pattern: identical inputs, public model outputs, autonomous decisions, and leaderboard scoring.

Benchmark design Nof1 technical post

Used for the repeated-decision harness idea and the distinction between model capability and benchmark behavior.

Official CPI target Statistics Canada April 2026 CPI

Used to anchor the current launch round, the June 22 May CPI release date, and the 2026 basket-weight update risk.

Inflation signals Bank of Canada capacity and inflation indicators

Used to justify a broad evidence set rather than a CPI-only source routine.

Forecast repositories US COVID-19 Forecast Hub dataset

Used for timestamped forecasts, model metadata, baseline/ensemble comparisons, validation, and public evaluation discipline.

Backtesting discipline Hyndman on time-series cross-validation

Used for rolling-origin evaluation and avoiding hindsight leakage when replaying historical forecasts.

Scoring discipline Gneiting and Raftery on proper scoring rules

Used to frame the future move from point-score leaderboard to calibrated probabilistic forecasts.

Agent architecture When Agents Trade

Used for the architecture lesson: context management, risk posture, and agent design can matter as much as the model backbone.