Context allocation for forecast alpha
A chief research office memo on why Inflation Agent is moving beyond simple publication automation toward a context-allocation, skill-evolution, locked, scored, and postmortem-driven forecasting system for Canadian CPI.
The benchmark only becomes interesting if the agent allocates context better than public forecasters over time.
The forecast can remain at 2.7% for several days if the evidence is stable. The goal is to decide how much attention belongs to history, the current state, the future path, and new research. v1.0.1 makes that context allocation explicit through forecast state, evidence records, decisions, skill governance, and methodology releases.
A fallback script can keep the site current, but it cannot prove that a real research pass happened.
Daily entries can now be traced to structured evidence, public forecast state, and a hold or revise decision.
Locked calls, external forecaster captures, official CPI actuals, postmortems, and curator releases create the track record.
The prior automation was useful but too shallow; v1.0.1 changes the operating model.
The site had a clean public arena: a current forecast, a release clock, a leaderboard shell, source links, validation tests, and a daily fallback publisher. That was enough for public continuity. It was not enough for a self-improving forecasting agent. v1.0.1 adds the missing public structure around state, evidence, decisions, skills, and methodology releases.
- Public forecast artifacts are simple and inspectable.
- The release lock is explicit before the Statistics Canada bell.
- The validation layer blocks malformed public JSON and unsafe fields.
- The fallback publisher prevents the site from silently going stale.
- No immutable lock file or post-release scoring artifact yet.
- No historical backtest or challenger promotion gate yet.
- No automated competitor forecast capture yet.
- No fully autonomous release curator proven over multiple rounds yet.
Research inputs and how they changed the project direction.
Live AI trading benchmark
Outcome benchmarks are more credible when models act under common rules, common inputs, public outputs, and real scoring.
Inflation Agent needs locked calls, common scoring rules, public decision summaries, and a leaderboard against consensus and banks.
Closed learning loop
Hermes treats skills as procedural memory, runs scheduled work as fresh jobs, supports subagents, and keeps a persistent agent surface.
Inflation Agent v1.0.1 adds public skill-family governance, bounded context packets, and recurring loops that can evolve without exposing private prompts.
Skill maintenance
The Curator pattern grades, consolidates, prunes, and reports on skills on a schedule, with scoped permissions and archives rather than uncontrolled mutation.
Inflation Agent adopts the principle as a methodology curator: review misses, user feedback, stale evidence, and skill performance before publishing versioned changes.
Timestamped forecast repository
Forecasts should be timestamped, formatted consistently, paired with metadata, and evaluated from the version that was actually public at the time.
Add immutable lock artifacts, methodology versions, and late-submission rules before the first scored CPI round.
Official scoring authority
The official CPI release defines the target, and the June 2026 basket update creates an unusually important source of forecast uncertainty.
Anchor scorekeeping to StatCan, track basket-weight changes explicitly, and keep public calls tied to the correct reference month.
Inflation indicator practice
Inflation pressure is assessed through a broad indicator set, not one CPI release in isolation.
The daily scout should include official data, public forecaster calls, labour, energy, shelter, demand, and credible alternative public signals.
Rolling origins and proper scoring
Forecast evaluation must avoid hindsight leakage and should use scoring rules that reward honest forecasts.
Start with absolute miss for readability, then add calibration and probabilistic scores once the forecast distribution is mature.
Architecture matters
Live agent results depend on context management, risk posture, and decision architecture, not only the base model.
Inflation Agent needs a harness with context allocation, specialist passes, abstention discipline, and postmortem-driven versioning.
The v1.0.1 harness separates evidence, state, decisions, skills, releases, locks, and scores.
Gather official CPI sources, public consensus, bank forecasts, labour, energy, shelter, food, mortgage, FX, shipping, and other sourceable inflation signals.
Normalize every observation into an append-only evidence record with source, timestamp, category, freshness, reliability, direction, magnitude, and forecast implication.
Maintain the accumulated forecast state: current point call, uncertainty range, context allocation, active hypotheses, open questions, stale signals, revision threshold, and model version.
Compare today's evidence against yesterday's state and output a structured decision: hold, revise, widen range, narrow range, lock, score, or postmortem.
Publish a compact public rationale. Do not publish private prompts, scratchpads, or internal chain-of-thought. Do publish the sources, drivers, decision, and changed evidence.
After official CPI releases or significant user feedback, score the round, classify the miss, review skill performance, and promote only changes that survive validation and public rationale.
Daily research should compound, but it should not overfit to noise.
The daily run should check whether new evidence exists, not merely summarize old links. If nothing material changed, the entry should say that in a precise way: no new official release, no fresh consensus move, no component signal strong enough to revise.
The forecast should move only when the estimated evidence delta exceeds the threshold. A 0.03 percentage point labour signal should not create a public 0.1 point revision unless it compounds with other signals.
Prior evidence should remain available in the forecast state so the model is not re-litigating the launch call every morning. New information should update the state, contradict it, expire it, or change the research budget.
Fallback entries should be labeled as fallback. They are operational continuity, not research. This protects the public record from confusing freshness with forecasting work.
Open the benchmark surface, not every private skill.
- Forecast numbers and timestamps.
- Source links and public evidence summaries.
- Lock times and scorekeeping rules.
- Round results, misses, and postmortems.
- Methodology version summaries.
- Exact private prompts and research skills.
- Scratchpads and internal chain-of-thought.
- Experimental challenger loops before promotion.
- Private automation credentials or implementation details.
- Licensed data that cannot be redistributed.
The right transparency target is reproducible scorekeeping, not full prompt disclosure. People should be able to judge whether the calls were made on time, what public evidence was cited, and how the score was computed.
v1.0.1 ships the public schema; the next step is proving it through autonomous operation.
Add `entry_type`, `decision`, `new_evidence_count`, and `revision_threshold_pp` to daily public entries.
Introduce `data/forecast_state.json` with active hypotheses, open questions, stale signals, and uncertainty range.
Introduce `data/evidence_log.json` and require real research entries to cite evidence records.
Create immutable `data/rounds/<period>/locked_forecast.json` and post-release score artifacts.
Operate challenger methods privately or separately, then promote only if postmortem evidence supports the change.
Version methodology changes like benchmark seasons so long-run performance can be compared fairly. v1.0.1 adds the first methodology release record.
Primary and methodological references used for this revision.
Used for the closed learning loop pattern: skills, memory nudges, cron jobs, subagents, and persistent agent operation.
Used for the curator model: scheduled skill grading, consolidation, pruning, archived reports, and scoped self-improvement.
Used for the release-surface lesson: autonomous systems need versioned public surfaces, not only private internal changes.
Used cautiously as an offline optimization reference for skills, prompts, and code. It is not treated as production live self-modification.
Used for the live benchmark pattern: identical inputs, public model outputs, autonomous decisions, and leaderboard scoring.
Used for the repeated-decision harness idea and the distinction between model capability and benchmark behavior.
Used to anchor the current launch round, the June 22 May CPI release date, and the 2026 basket-weight update risk.
Used to justify a broad evidence set rather than a CPI-only source routine.
Used for timestamped forecasts, model metadata, baseline/ensemble comparisons, validation, and public evaluation discipline.
Used for rolling-origin evaluation and avoiding hindsight leakage when replaying historical forecasts.
Used to frame the future move from point-score leaderboard to calibrated probabilistic forecasts.
Used for the architecture lesson: context management, risk posture, and agent design can matter as much as the model backbone.