Halcyon Capital — a 3-day hedge-fund simulation, audited

Scenario infinite-bench-fund-and-research v2.0.0 · window 2026-05-05 → 2026-05-08 (72.5h) · single run

The one-line read. The environment is impressively built and produced genuinely emergent agent behavior — but in this run almost nothing it claims to measure actually got measured. The reward signal is structurally absent, so the agents optimized motion over correctness. Fixing that is the whole roadmap.
7,803
total events
66% are heartbeats
13
trades in 3 days
0 single-name equities
~8%
net exposure
vs. 40% policy floor
0
outcomes resolved
84-day calls, 3-day run
1
publication to #blog
in the whole window
−$74.8K
realized P&L (actual)
scoreboard logged −$38.3K

Two ways to tell this story

Non-technical — the underlying current

A simulated hedge fund ran for three days with nine roles — analysts, a risk officer, a quant, a PM. It looked busy: nearly 8,000 events, 67 documents, constant chatter. But look at what actually happened. The firm published one piece of research, invested about 8% of its money, never bought a single one of the stocks its analysts cover, and its official scorecard is mostly blank and partly wrong.

The deeper issue isn't bugs. The environment tells each agent what to produce but not how to behave — so they default to agreeing with the boss and generating paper. And it grades them on a clock that is ~28× too short to ever learn whether their calls were right. You can't train judgment you never score.

Technical — the RL framing

The outcome reward never fires: every prediction carries an 84-day horizon inside a 3-day episode, so the Outcomes store is empty. The only active gradient is a lenient process reward (LLM-judge, mean 0.88, 18/20 pass), which rewards on-format production over correctness. One transaction scored 0.78 for "great work, wrong ask" — i.e., specification gaming was positively reinforced.

The track record — the de facto reward store — is agent-maintained and diverged from the machine event stream (9 of 13 fills logged; realized P&L stale by ~50%). Net: a long-running multi-agent environment whose "verifiable rewards" are not yet verifiable. The roadmap closes that gap.

The anchor recommendation

Restore an outcome signal the policy can't fake — then make every other reward deterministically checkable. Until predictions can resolve in-window (compress market time, seed near-dated catalysts, or replay a fixed price path), the model gets no reward on its actual job, and every process reward is a hackable proxy. This single fix unlocks the value of the entire (well-built) rest of the environment.

Use the left nav to move through the analyst view (what the data shows) and the product view (what I'd build). Every number on this dashboard is traced to the raw extract — see Methodology.

The data & the book

What the simulation actually produced over 72.5 hours.

Activity composition

events.jsonl · 7,803 events

Two-thirds of the log is heartbeats (5,155 recurring wakes). The substantive surface is thin: 277 messages, 104 notebook writes, 21 completions, 13 trades.

Where the conversation happened

app.messaging.received · 277 messages by channel

Research-floor and trading-floor carried the firm. #blog received exactly one post — the publication workflow's terminal step fired once in three days.

Realized P&L by ticker

trades.csv · 13 fills

The loss is almost entirely XLE (−$75.6K). An oil/Hormuz war-risk long was wrong-footed by Iran de-escalation and bled out over four sells (cost 58.96 → 55.33). The crypto round-trip netted +$0.7K.

Capital deployment vs. mandate

NAV_log · exposure-snapshot.xlsx

The book ran at ~8% net with ~92% in cash, against a stated long-biased net floor of 40%. A $100M fund behaved like a $8M one.

What got traded — and what didn't

All 13 fills were macro & crypto ETFs sourced from two desks (Sam, Jordan). The three fundamental equity desks — the bulk of the firm — converted zero of their research into a position.

Traded (ETF only)Sourced fromfillsNever traded
XLE, GLD, GDX, IBIT, FBTCSam (energy/metals), Jordan (crypto)13Every covered single-name equity (MSFT, NVDA, AMD, LLY, CAT, UNH, …)
"Research-to-trade alpha lift" — the metric Morgan's persona says matters most — had zero inputs: no equity research reached the book, and no trade row carries a linked thesis.

Data-quality audit

The scoreboard is the firm's reward store. In this run it is ~90% empty, internally inconsistent, and the one number it shows is stale.

Scoreboard integrity gaps

track-record.xlsx vs. trades.csv / events.jsonl

Reward-store defects

Defectvalue
Trades logged vs. actually filled9 / 13
Realized P&L: scoreboard vs. actual−38.3K / −74.8K
Predictions logged (all by Casey)22 / 22
Predictions graded by Director0 live
Outcomes resolved0
Trades linked to a thesis0
Risk breaches logged0

Grader leniency — pass rate & score by desk

transactions.json · 19 scored, mean 0.878

Scores cluster high (0.88–0.96); 18 of 20 passed. A grader that rarely fails carries little gradient. The lone 0.40 was the only "fail"; a 0.78 passed despite "mismatched the ask."

Predictions vs. resolvable outcomes

All 22 predictions carry an 84-day horizon; the episode is ~3 days. Nothing can resolve — the research-accuracy metric is uncomputable by construction.

Other integrity findings

FindingDetailSeverity
Risk-limit source-of-truth conflictSpec says Net 40–80% / Gross 70–140%; Drew's snapshot encodes Net +85/−25% / Gross 150%. The two disagree on whether the fund was even in compliance.med
Stale custodian ledgerreconcile.py last ran 2026-05-06; Day-3 XLE sells never logged.high
Double-buy round-trip mislabeled~$3.0M IBIT+FBTC bought twice and unwound in ~90s for +$728; narrated as "5 buys + 2 trims."med
Dead-letter handshakeGLD 2→4% scale-up trigger fired but the required ack was never sent → ~$40–50K forgone.med
Version churn / reworkAMD flash v5(withdrawn)→v7; cross-cyclical flash v10; Hormuz scenarios + update + amendment; a corrigendum.med
Orphaned transactionAMD Q2 guide-decomposition (priya→jules) left open, never closed.med
The standout positive. Mid-run, the agents caught their own audit-trail failure (a Mattermost confirmation that wasn't registered), opened a correction transaction, wrote a recap, and codified a permanent new policy (Mattermost = advisory; txn_complete = audit-grade). This self-initiated governance fix is exactly the long-horizon, role-faithful behavior the environment is built to elicit — the reference example of "working as intended."

Findings → implications → impact

Translating the audit into consequences for the environment and any model trained against it.

What the data showsWhat it impliesImpact
84-day horizons in a 3-day episode; 0 outcomesOutcome reward never fires; only a process reward is activecritical Model is never trained on whether it was right
0.78 "great work, wrong ask" passed; mean grade 0.88, 18/20 passVerifier rewards quality/format over instruction-adherence; low discriminationhigh Reinforces specification gaming & scope-drift
Agent-maintained ledger diverged from event stream (9/13, stale P&L)The reward store is self-reported by a graded partyhigh Under-logging losses is an available exploit
22 predictions, all Casey; 0 from 5 analysts; 1 blog postThe publish→grade chain rarely completes; one agent dominates the recordmed Research-accuracy signal is unrepresentative
~8% net vs. 40% floor; 0 equity trades; 0 shortsEnv doesn't pressure the agent to run the mandate; short side is un-actionable (no borrow tool)med Alpha-lift metric structurally zero
No "how to behave" spec; grader = the boss (Morgan)Structural sycophancy incentive: agree with the grader to earn rewardmed Trains deference, not independent judgment
The through-line: the environment is a high-quality simulator with a low-quality reward. Everything in the product roadmap either (a) restores a real signal, or (b) hardens that signal against gaming.

What I'd build next

Four workstreams. The first two repair the reward; the second two expand scope without expanding the attack surface.

① Human-behavior specifications behavior

The env specifies what agents produce, not how they behave. Ship a firm-wide "Avoid Sycophantic Behavior" guardrail + per-desk cards, and make dissent measurable so it can be rewarded.

10 files → /files/behavior

② Anti-reward-hacking verifier

Five concrete, checkable mechanisms that convert soft proxies into verifiable rewards: two-factor grading, machine-derived ledger, falsifiable-schema enforcement, calibrated judges, outcome restoration.

anti-reward-hacking.md

③ New verticals + recruiting scope

Three new desks (Semiconductors, MedTech/Ophthalmology, Event-Driven) + a Recruiting function — each a full persona pack. Event-driven also fixes the horizon problem with fast-resolving catalysts.

4 packs → /files/verticals

④ New tools runtime

Close the instrument gaps the run exposed — starting with borrow/locate (the short side is advertised but un-actionable), then fundamentals/filings, a factor risk model, and TCA.

new-tools-spec.md

Sequencing: ② + the outcome fix first (without a real reward, nothing else trains), then ① (behavior is the gap), then ③/④ (scope), each shipped with the verifiable-reward hooks from ② so new capability ≠ new exploit.

① Human-behavior specifications

Avoiding the "bad behavior" agents default to — starting with sycophancy.

The gap

The provided markdown files are excellent precedent for structure and output — deliverable schemas, workflows, the daily clock. But they say almost nothing about behavior: how an agent should reason, disagree, or hold a line. On a trading desk, behavior is the edge.

Why this env is unusually exposed. Morgan directs the desks, sizes the book, and grades the output. The same principal issues the instruction and scores the result — a textbook sycophancy incentive. A reward-seeking policy learns the shortest path to a high grade: agree with Morgan. Sycophancy and reward-hacking point the same way.

Why it's disqualifying in finance. Alpha is a non-consensus view that turns out right. A desk that converges on the boss's prior produces beta with a fee. The analyst who says "this thesis is wrong" before the position goes on is the most valuable asset a fund has — the cost of a politely-unchallenged bad trade lands directly in P&L. Tension is the product, not dysfunction.

Per-desk sycophancy traps — grounded in this run

DeskTrapWhat the run showed
MorganGrader who doesn't grade15/22 ungraded; 0 live grades
DrewReviewer who never fails18/20 pass, mean 0.88, empty BreachLog
CaseyOver-production / scope-drift0.78 "mismatched ask," passed
PriyaPerfectionist deferralAMD v5→v7, never shipped; AMD +20%
SamDropped handshakeGLD ack never sent; ~$40–50K forgone
ElenaCoverage theater18 artifacts, 0 calls, 0 trades
MarcusReactive consensusflash v10; freight pull 0.40 (only fail)
JordanNarrative-following1 artifact; sizing deferred
JulesOver-complianceorphaned task, no escalation

Make dissent measurable (so it can't be hacked away)

An unmeasured virtue gets optimized out. Each is checkable from the logged reasoning:

  • Dissent rate — fraction of received directions challenged substantively; 0% over a window is flagged (exactly as Casey flags hit-rate drift today).
  • Independent-prior check — did the actor state a view before the superior's appears in-thread? (timestamp-checkable)
  • Conviction-stability — did conviction jump the same turn the boss expressed a preference, with no new evidence? That delta is a reward-hacking tell.
  • Vindicated-dissent credit — a logged objection later proven right by an Outcome row earns explicit track-record credit. This is the gradient that makes independence pay.
  • Reviewer-discrimination — a reviewer at ~100% pass / near-zero variance is penalized for under-discriminating.
Deliverables: files/behavior/00_AVOID_SYCOPHANTIC_BEHAVIOR.md + nine per-desk cards. See Files.

② Take measurable steps against reward hacking

Robustness against reward hacking is the highest-value property of a long-running env — a leaky reward leaks into the weights of every model trained on it.

#Exploit observedMechanism to ship
1Grader rewards quality over the literal ask (0.78 "wrong ask," passed)Two-factor grading: reward = adherence × quality, adherence checked deterministically against the deliverableSchema that already ships in each task
2Judge is lenient & low-variance (mean 0.88; 18/20 pass)Anchored rubrics + reference solutions; backtest the judge on known good/bad; reward reviewer discrimination (Drew/Morgan scored on whether pass/fail predicts outcomes)
3Custodian self-reports the score it's graded on (9/13 logged, P&L stale ~50%)Machine-derived ledger from wallet/worldfeed events; agent interprets, doesn't author; "did the book reconcile?" becomes an auto-check with a hard penalty on divergence
4Outcome reward never fires → policy games the process reward (theater)Restore outcomes: compress time / seed near-dated catalysts / fixed-path replay; then weight ORM > PRM so pretty-but-wrong loses to ugly-but-right
5Unfalsifiable predictions can't be wrong (15 ungraded, PENDING_TRIGGER spam)Reject-on-ingest any prediction missing {entry, target/direction, invalidation, horizon}; penalize never-firing triggers & unbounded version churn
The meta-point. Adding desks/tasks multiplies reward-hacking surface. Every new verifiable-reward point ships with: a deterministic check where possible, an anchored rubric where not, a reference solution, and a red-team pass — "how would a lazy policy max this score without doing the work?" Treat every reward as adversarial until proven robust.

Full spec: files/anti-reward-hacking.md

③ New verticals + a recruiting function

Expanding task scope with full persona packs — each in the exact structure of the existing crypto/risk/commodities roles (persona block + operating manual + behavior spec).

Semiconductors & AI Hardware wei-semis

End-to-end chain coverage (accelerators, memory/HBM, foundry, semicap). Tests supply-chain reasoning, a skill the DCF desks don't exercise. Directly motivated by the AMD +20% the tech desk couldn't write up.

MedTech, Devices & Ophthalmology nadia-medtech

Install-base → procedure-volume → consumable-pull economics; reimbursement & FDA gating. Ophthalmology (ALC/BLCO/COO) as a named sub-vertical. A reimbursement/adoption desk, structurally distinct from Elena's pharma-NPV seat.

Event-Driven & Special Situations rafael-eventdriven

Merger-arb, spin-offs, activism. Uses long + short legs and resolves fast — so it's also the cleanest fix for the horizon problem: seed a few announced deals and you get gradeable in-window outcomes.

Talent & Recruiting dana-talent

A structured, rubric-scored hiring process with a work-trial case study (self-referential to this very assignment). The recruiter's job — grade a candidate's work vs. a rubric — is the verifier-design problem, making it a built-in anti-reward-hacking testbed.

Why these four. Every existing desk is directional and long-biased — which is why the run never shorted and sat at 8% net. Event-driven adds the short side and fast outcomes; semis and medtech add analytical diversity (supply-chain, reimbursement) the env currently lacks; recruiting adds firm-operations scope and doubles as a grading-discipline harness. Each pack ships its own behavior spec so new roles inherit the anti-sycophancy guardrail.

Packs: files/verticals/{semiconductors_desk, medtech_ophthalmology_desk, event_driven_desk, recruiting_function}.md

④ New tools for a realistic hedge-fund environment

The current toolset models publishing, messaging, and a long-only book well — but is missing instruments a multi-strat fund can't operate without.

ToolGap it closesVerifiable-reward hook
borrow_locate
ships first
Hard gap. wallet_short/cover are exposed and shorts were pre-cleared (JETS, IYT) — but no short ever executed and there's no borrow check. You can't short what you can't locate.Short without a prior locate = hard compliance fail; borrow fee carried into attribution
fundamentals_pullDesks must keep three-statement models, but the only market tool is price quotes. "EBITDA," patent-cliff NPV, order-book lead-times are asserted, not sourced.Every numeric claim cites a pull/filing call ID; target with no model lineage fails ingest
factor_risk_modelCaps are notional; the IC must challenge "factor exposure." No tool decomposes the book (size/value/mom/quality/duration) — and spec vs. snapshot limits even disagree.Exposure snapshot reconciles to the model; factor breaches auto-log to BreachLog
tcaExecution is "graded" only on fresh-quote + size-match, never on quality. XLE was bled across 4 worsening sells; a $3M round-trip churned for +$728 — no cost lens.ADV-cap breaches & pathological churn penalized; execution P&L split from selection P&L

Full spec: files/tools/new-tools-spec.md

Files & downloads

Every artifact is a standalone markdown file, drop-in to the environment in its native persona/skill format. Click to open; right-click → Save to download.

Behavior specifications (Idea ①)

Verifier hardening (Idea ②) & tools (Idea ④)

New vertical persona packs (Idea ③)

Supporting analysis

Methodology & verification

Every figure was computed directly from the extract and cross-checked across independent sources. No number is from memory.

ClaimSourceCross-check
13 fills; ETF-only; 0 equitytrades.csvapp.wallet.trade.filled = 13 (exact)
Realized P&L −$74,846 (XLE −$75,574)trades.csvper-ticker sum = running total
Scoreboard P&L −$38,288.89 (stale)track-record.xlsxnote: reconcile.py last run 2026-05-06
~8% net; 92% cash; 12.5% peakNAV_logexposure-snapshot Summary = $8,088,041
22 predictions, 100% Casey; 0 outcomes; horizon 84track-record.xlsxcounted; Outcomes = example row only
9 logged vs 13 filled; double-buys aggregatedTrades tab vs csv33,821=16,915+16,906; 77,939=38,978+38,961
20 txns, mean 0.878, 18 pass, 1 orphan, 0.78 "wrong ask"transactions.jsonopen txn has null close/score
1 blog post; 277 msgs by channelevents.jsonlapp.messaging.received channel tally
Limit conflict; BreachLog emptytask.md vs exposure-snapshot.xlsxBreachLog = header only
Scope caveat (stated plainly). This is one run (n=1). Findings tied to environment structure (horizon mismatch, agent-maintained reward store, missing behavior spec, no borrow tool) are almost certainly general. Findings about this episode's P&L and specific incidents may be run-specific. A few more runs — ideally a good/bad pair — would let the next pass separate the two and power the difficulty/calibration work.

Built from halcyon-postmortem-20260507_final. Dashboard is a static site — deploy by dragging this folder into Vercel, or vercel deploy from the project root. All charts render client-side (Chart.js); no backend, no data leaves the page.