Halcyon Capital — a 3-day hedge-fund simulation, audited

Scenario infinite-bench-fund-and-research v2.0.0 · window 2026-05-05 → 2026-05-08 (72.5h) · single run

The bottom line. Halluminate built a genuinely realistic hedge-fund simulation and had an AI run every role for three days. The catch: the simulation can't yet tell whether the AI did a good job — so the AI ended up rewarded for looking busy rather than for making money. Everything in this roadmap is about fixing how the AI is scored.

7,803

total events

66% are heartbeats

trades in 3 days

0 single-name equities

~8%

net exposure

vs. 40% policy floor

outcomes resolved

84-day calls, 3-day run

publication to #blog

in the whole window

−$74.8K

realized P&L (actual)

scoreboard logged −$38.3K

Two ways to tell this story

In plain terms

A hedge fund is judged on one question: did your calls make money? Over three days this firm generated nearly 8,000 events and 67 documents — but it published only one piece of research, made just 13 trades, and never once bought a stock its own analysts cover. It looked extremely busy and did almost no real investing.

Why? Because nothing it did could be scored as right or wrong (see the panel to the right), the only thing left to reward was activity — so activity is what it produced.

For the ML / infra side

The simulation is supposed to train and grade an AI by rewarding good behavior. But the "was the call right?" reward can never pay out here: every prediction the AI logged was a 3-month (84-day) bet, and the run lasted 3 days — so nothing resolves. The only signal left is "did you produce a tidy memo," which is exactly what the AI optimized for.

On top of that, the AI kept the firm's own scorecard and under-recorded it: 9 of 13 trades logged, and reported losses about half the real figure. So even the numbers it's graded on can't be trusted.

The anchor recommendation

Make the AI's calls actually resolve, and take the scorecard out of its hands. Run the simulation long enough — or seed it with near-term events like earnings dates — so the AI's predictions come true or fail within the run, so you can finally score whether it was right, not just whether it was busy. And compute the firm's books automatically from the trade log instead of letting the AI grade its own homework. Every other recommendation builds on this one.

Key reference files (click to read in-page)

verified_stats.json anti-reward-hacking.md 00_AVOID_SYCOPHANTIC_BEHAVIOR.md

Use the left nav to move through the analyst view (what the data shows) and the product view (what I'd build). Every number is traced to the raw extract — see Methodology.

The data & the book

What the simulation actually produced over 72.5 hours.

Activity composition

events.jsonl · 7,803 events

Two-thirds of the log is heartbeats (5,155 recurring wakes). The substantive surface is thin: 277 messages, 104 notebook writes, 21 completions, 13 trades.

Where the conversation happened

app.messaging.received · 277 messages by channel

Research-floor and trading-floor carried the firm. #blog received exactly one post — the publication workflow's terminal step fired once in three days.

Realized P&L by ticker

trades.csv · 13 fills

This is realized P&L — money actually locked in by selling. The loss is almost entirely XLE: bought at $58.96, then sold down at $57.64 → $57.27 → $55.82 → $55.33, locking in −$75.6K as oil fell. The crypto trades netted +$0.7K. GLD and GDX read $0 because they were never sold — their gains are unrealized, so they don't appear here. Every figure ties to a row in trades.csv.

Capital deployment vs. mandate

NAV_log · exposure-snapshot.xlsx

The book ran at ~8% net with ~92% in cash, against a stated long-biased net floor of 40%. A $100M fund behaved like an $8M one.

What got traded — and what didn't

All 13 fills were macro & crypto ETFs sourced from two desks (Sam, Jordan). The three fundamental equity desks — the bulk of the firm — converted zero of their research into a position.

Traded (ETF only)	Sourced from	fills	Never traded
XLE, GLD, GDX, IBIT, FBTC	Sam (energy/metals), Jordan (crypto)	13	Every covered single-name equity (MSFT, NVDA, AMD, LLY, CAT, UNH, …)

"Research-to-trade alpha lift" — the metric Morgan's persona says matters most — had zero inputs: no equity research reached the book, and no trade row carries a linked thesis.

Reference files — read any of these in-page

Click to open; each has Download / Raw inside the viewer. Nothing leaves this page.

Supporting data

verified_stats.json

Roadmap specs

anti-reward-hacking.md new-tools-spec.md 00_AVOID_SYCOPHANTIC_BEHAVIOR.md

New vertical persona packs

semiconductors_desk.md medtech_ophthalmology_desk.md event_driven_desk.md recruiting_function.md

Data-quality audit

The scoreboard is the firm's official record of how it's doing. In this run it's mostly empty, contradicts itself, and the one number it does show is wrong.

Scoreboard integrity gaps

track-record.xlsx vs. trades.csv / events.jsonl

What's missing from the scorecard

Defect	value
Trades logged vs. actually filled	9 / 13
Realized P&L: scoreboard vs. actual	−38.3K / −74.8K
Predictions logged (all by Casey)	22 / 22
Predictions graded by Director	0 live
Outcomes resolved	0
Trades linked to a thesis	0
Risk breaches logged	0

Grader leniency — pass rate & score by desk

transactions.json · 19 scored, mean 0.878

Scores cluster high (0.88–0.96); 18 of 20 passed. A grader that almost never fails gives almost no useful signal. The lone 0.40 was the only "fail"; one piece scored 0.78 and passed even though the review note itself said it "mismatched the ask."

Predictions vs. resolvable outcomes

All 22 predictions are 3-month (84-day) calls; the run lasted ~3 days. None can come true or fail in time, so "how accurate was the research?" can never be calculated here.

Other things that went wrong (all verified in the raw files)

Each row is checked against timestamps, spreadsheet cells, or the trade log — these are real events from the run, in plain English.

What happened	The detail	Severity
The firm's own books understate the loss by half	The official P&L spreadsheet says it lost $38.3K. The real loss was $74.8K. The script that updates the spreadsheet last ran on Day 2, so the Day-3 XLE sales (down to $55) were never recorded.	high
A trade was placed twice and unwound	At 13:35 it bought IBIT + FBTC; one minute later it bought the identical amounts again, then sold the duplicate two minutes later for a $728 gain. The note described this as "5 buys + 2 trims," which hides the duplicate.	med
A pre-approved trade never happened	Standing rule: if gold rose >1.5%, lift the gold position from 2% to 4%. Gold rose +3.03%, but the required confirmation message was never sent — so the increase never fired. The risk file itself estimates ~$40–50K of missed gains.	med
The same documents were rewritten over and over	The AMD note went from v5 (withdrawn) to v7; a sector note reached "v10"; the oil-scenario note had a base version plus an "update" and an "amendment" — lots of redoing, little finishing.	med
A task was assigned and never finished	An analyst asked the junior for an "AMD earnings breakdown." It was opened and left open — never completed by the end of the run.	med
Two rulebooks disagree on the risk limits	The master spec says the fund's net exposure should be 40–80%. The risk officer's own spreadsheet says the limit is +85% / −25%. There's no single source of truth for "are we within the rules?"	med

The standout positive. Mid-run, the agents caught their own audit-trail failure (a Mattermost confirmation that wasn't registered), opened a correction transaction, wrote a recap, and codified a permanent new policy (Mattermost = advisory; txn_complete = audit-grade). This self-initiated governance fix is exactly the long-horizon, role-faithful behavior the environment is built to elicit — the reference example of "working as intended."

So what — what we saw, why it matters, what to do

The same findings as the audit, said simply, with the action each one points to.

What we saw	Why it matters	What to do
Every prediction was a 3-month call, but the run lasted 3 days — so none could resolve.	The AI is never told whether its calls were right.	critical Lengthen or structure the run so calls resolve in time.
A memo that ignored the request still scored 0.78 and passed; the average grade was 0.88 and 18 of 20 passed.	The grader rewards polished work even when it's off-target, and almost never fails anything.	high Grade "did it do what was asked?" first; let weak work fail.
The AI kept the firm's scorecard and got it wrong (9 of 13 trades logged; half the losses).	The numbers used to judge the AI are produced by the AI.	high Compute the books automatically from the trade log.
All 22 predictions came from one role (the quant); the other desks published almost nothing.	The "research quality" record reflects one desk, not the whole firm.	med Make every desk produce calls that can be graded.
The fund invested only ~8% of its money and never shorted, though it was built to do both.	Most of the firm's intended activity simply never happened.	med Add the missing tools (e.g. stock-borrow) and pressure to actually invest.
The boss both directs the analysts and grades their work.	The AI learns that agreeing with the boss is what earns a good grade.	med Reward independent pushback — and measure it.

The one sentence to remember: the simulation is a great stage with no working scoreboard — so it rewards looking busy over being right. Fix the scoreboard first; everything else follows.

What I'd build next

Four workstreams. The first two repair the reward; the second two expand scope without expanding the attack surface.

I · Human-behavior specifications behavior

The env specifies what agents produce, not how they behave. Ship a firm-wide "Avoid Sycophantic Behavior" guardrail + per-desk cards, and make dissent measurable so it can be rewarded.

10 files → /files/behavior

II · Anti-reward-hacking verifier

Five concrete, checkable mechanisms that convert soft proxies into verifiable rewards: two-factor grading, machine-derived ledger, falsifiable-schema enforcement, calibrated judges, outcome restoration.

anti-reward-hacking.md

III · New verticals + recruiting scope

Three new desks (Semiconductors, MedTech/Ophthalmology, Event-Driven) + a Recruiting function — each a full persona pack. Event-driven also fixes the horizon problem with fast-resolving catalysts.

4 packs → /files/verticals

IV · New tools runtime

Close the instrument gaps the run exposed — starting with borrow/locate (the short side is advertised but un-actionable), then fundamentals/filings, a factor risk model, and TCA.

new-tools-spec.md

Sequencing: II + the outcome fix first (without a real reward, nothing else trains), then I (behavior is the gap), then III/IV (scope), each shipped with the verifiable-reward hooks from II so new capability ≠ new exploit.

I · Human-behavior specifications

Avoiding the "bad behavior" agents default to — starting with sycophancy.

The gap

The provided markdown files are excellent precedent for structure and output — deliverable schemas, workflows, the daily clock. But they say almost nothing about behavior: how an agent should reason, disagree, or hold a line. On a trading desk, behavior is the edge.

Why this env is unusually exposed. Morgan directs the desks, sizes the book, and grades the output. The same principal issues the instruction and scores the result — a textbook sycophancy incentive. A reward-seeking policy learns the shortest path to a high grade: agree with Morgan. Sycophancy and reward-hacking point the same way.

Why it's disqualifying in finance. Alpha is a non-consensus view that turns out right. A desk that converges on the boss's prior produces beta with a fee. The analyst who says "this thesis is wrong" before the position goes on is the most valuable asset a fund has — the cost of a politely-unchallenged bad trade lands directly in P&L. Tension is the product, not dysfunction.

Per-desk sycophancy traps — grounded in this run

Click any desk to open its behavior card in-page.

Desk	Trap	What the run showed	File
Morgan	Grader who doesn't grade	15/22 ungraded; 0 live grades	open
Drew	Reviewer who never fails	18/20 pass, mean 0.88, empty BreachLog	open
Casey	Over-production / scope-drift	0.78 "mismatched ask," passed	open
Priya	Perfectionist deferral	AMD v5→v7, never shipped; AMD +20%	open
Sam	Dropped handshake	GLD ack never sent; ~$40–50K forgone	open
Elena	Coverage theater	18 artifacts, 0 calls, 0 trades	open
Marcus	Reactive consensus	flash v10; freight pull 0.40 (only fail)	open
Jordan	Narrative-following	1 artifact; sizing deferred	open
Jules	Over-compliance	orphaned task, no escalation	open

Make dissent measurable (so it can't be hacked away)

An unmeasured virtue gets optimized out. Each is checkable from the logged reasoning:

Dissent rate — fraction of received directions challenged substantively; 0% over a window is flagged (exactly as Casey flags hit-rate drift today).
Independent-prior check — did the actor state a view before the superior's appears in-thread? (timestamp-checkable)
Conviction-stability — did conviction jump the same turn the boss expressed a preference, with no new evidence? That delta is a reward-hacking tell.
Vindicated-dissent credit — a logged objection later proven right by an Outcome row earns explicit track-record credit. This is the gradient that makes independence pay.
Reviewer-discrimination — a reviewer at ~100% pass / near-zero variance is penalized for under-discriminating.

All behavior files

00_AVOID_SYCOPHANTIC_BEHAVIOR.md morgan-md drew-risk casey-quant priya-tech sam-energy elena-healthcare marcus-cyclicals jordan-crypto jules-associate

II · Take measurable steps against reward hacking

Robustness against reward hacking is the highest-value property of a long-running env — a leaky reward leaks into the weights of every model trained on it.

#	Exploit observed	Mechanism to ship
1	Grader rewards quality over the literal ask (0.78 "wrong ask," passed)	Two-factor grading: reward = adherence × quality, adherence checked deterministically against the deliverableSchema that already ships in each task
2	Judge is lenient & low-variance (mean 0.88; 18/20 pass)	Anchored rubrics + reference solutions; backtest the judge on known good/bad; reward reviewer discrimination (Drew/Morgan scored on whether pass/fail predicts outcomes)
3	Custodian self-reports the score it's graded on (9/13 logged, P&L stale ~50%)	Machine-derived ledger from wallet/worldfeed events; agent interprets, doesn't author; "did the book reconcile?" becomes an auto-check with a hard penalty on divergence
4	The "was it right?" reward never fires, so the AI games the only reward left — producing tidy output	Restore outcomes: compress time / seed near-term events / fixed-path replay; then weight the score toward outcomes so a pretty-but-wrong note loses to an ugly-but-right one
5	Unfalsifiable predictions can't be wrong (15 ungraded, PENDING_TRIGGER spam)	Reject-on-ingest any prediction missing {entry, target/direction, invalidation, horizon}; penalize never-firing triggers & unbounded version churn

The meta-point. Adding desks/tasks multiplies reward-hacking surface. Every new verifiable-reward point ships with: a deterministic check where possible, an anchored rubric where not, a reference solution, and a red-team pass — "how would a lazy policy max this score without doing the work?" Treat every reward as adversarial until proven robust.

Read full spec → anti-reward-hacking.md

III · New verticals + a recruiting function

Expanding task scope with full persona packs — each in the exact structure of the existing crypto/risk/commodities roles (persona block + operating manual + behavior spec). Click any card to read the pack.

Semiconductors & AI Hardware wei-semis

End-to-end chain coverage (accelerators, memory/HBM, foundry, semicap). Tests supply-chain reasoning, a skill the DCF desks don't exercise. Directly motivated by the AMD +20% the tech desk couldn't write up.

MedTech, Devices & Ophthalmology nadia-medtech

Install-base → procedure-volume → consumable-pull economics; reimbursement & FDA gating. Ophthalmology (ALC/BLCO/COO) as a named sub-vertical. A reimbursement/adoption desk, structurally distinct from Elena's pharma-NPV seat.

Event-Driven & Special Situations rafael-eventdriven

Merger-arb, spin-offs, activism. Uses long + short legs and resolves fast — so it's also the cleanest fix for the horizon problem: seed a few announced deals and you get gradeable in-window outcomes.

Talent & Recruiting dana-talent

A structured, rubric-scored hiring process with a work-trial case study (self-referential to this very assignment). The recruiter's job — grade a candidate's work vs. a rubric — is the verifier-design problem, making it a built-in anti-reward-hacking testbed.

Why these four. Every existing desk is directional and long-biased — which is why the run never shorted and sat at 8% net. Event-driven adds the short side and fast outcomes; semis and medtech add analytical diversity (supply-chain, reimbursement) the env currently lacks; recruiting adds firm-operations scope and doubles as a grading-discipline harness. Each pack ships its own behavior spec so new roles inherit the anti-sycophancy guardrail.

IV · New tools for a realistic hedge-fund environment

The current toolset models publishing, messaging, and a long-only book well — but is missing instruments a multi-strat fund can't operate without.

Tool	Gap it closes	Verifiable-reward hook
borrow_locate ships first	Hard gap. wallet_short/cover are exposed and shorts were pre-cleared (JETS, IYT) — but no short ever executed and there's no borrow check. You can't short what you can't locate.	Short without a prior locate = hard compliance fail; borrow fee carried into attribution
fundamentals_pull	Desks must keep three-statement models, but the only market tool is price quotes. "EBITDA," patent-cliff NPV, order-book lead-times are asserted, not sourced.	Every numeric claim cites a pull/filing call ID; target with no model lineage fails ingest
factor_risk_model	Caps are notional; the IC must challenge "factor exposure." No tool decomposes the book (size/value/mom/quality/duration) — and spec vs. snapshot limits even disagree.	Exposure snapshot reconciles to the model; factor breaches auto-log to BreachLog
tca	Execution is "graded" only on fresh-quote + size-match, never on quality. XLE was bled across 4 worsening sells; a $3M round-trip churned for +$728 — no cost lens.	ADV-cap breaches & pathological churn penalized; execution P&L split from selection P&L

Read full spec → new-tools-spec.md

Files & downloads

Full catalog. Click to read in-page; each viewer has Download / Raw. (Files are also surfaced contextually inside the relevant tabs above.)

Methodology & verification

Every figure was computed directly from the extract and cross-checked across independent sources. No number is from memory.

Claim	Source	Cross-check
13 fills; ETF-only; 0 equity	trades.csv	app.wallet.trade.filled = 13 (exact)
Realized P&L −$74,846 (XLE −$75,574)	trades.csv	per-ticker sum = running total
Scoreboard P&L −$38,288.89 (stale)	track-record.xlsx	note: reconcile.py last run 2026-05-06
~8% net; 92% cash; 12.5% peak	NAV_log	exposure-snapshot Summary = $8,088,041
22 predictions, 100% Casey; 0 outcomes; horizon 84	track-record.xlsx	counted; Outcomes = example row only
9 logged vs 13 filled; double-buys aggregated	Trades tab vs csv	33,821=16,915+16,906; 77,939=38,978+38,961
20 txns, mean 0.878, 18 pass, 1 orphan, 0.78 "wrong ask"	transactions.json	open txn has null close/score
1 blog post; 277 msgs by channel	events.jsonl	app.messaging.received channel tally
Limit conflict; BreachLog empty	task.md vs exposure-snapshot.xlsx	BreachLog = header only

Scope caveat (stated plainly). This is one run (n=1). Findings tied to how the environment is built (the prediction horizon vs. run length, the AI keeping its own scorecard, no behavior spec, no stock-borrow tool) are almost certainly general. Findings about this episode's P&L and specific incidents may be run-specific. A few more runs — ideally a good/bad pair — would let the next pass separate the two and power the difficulty/calibration work.

Built from the halcyon-postmortem-20260507_final extract. Static site — no backend, no telemetry; nothing leaves the page. Charts render client-side (Chart.js); file previews render with marked.js.

Halcyon Capital — a 3-day hedge-fund simulation, audited

Two ways to tell this story

The anchor recommendation

The data & the book

Activity composition

Where the conversation happened

Realized P&L by ticker

Capital deployment vs. mandate

What got traded — and what didn't

Reference files — read any of these in-page

Data-quality audit

Scoreboard integrity gaps

What's missing from the scorecard

Grader leniency — pass rate & score by desk

Predictions vs. resolvable outcomes

Other things that went wrong (all verified in the raw files)

So what — what we saw, why it matters, what to do

What I'd build next

I · Human-behavior specifications behavior

II · Anti-reward-hacking verifier

III · New verticals + recruiting scope

IV · New tools runtime

I · Human-behavior specifications

The gap

Per-desk sycophancy traps — grounded in this run

Make dissent measurable (so it can't be hacked away)

II · Take measurable steps against reward hacking

III · New verticals + a recruiting function

Semiconductors & AI Hardware wei-semis

MedTech, Devices & Ophthalmology nadia-medtech

Event-Driven & Special Situations rafael-eventdriven

Talent & Recruiting dana-talent

IV · New tools for a realistic hedge-fund environment

Files & downloads

Behavior specifications (I)

Verifier hardening (II) & tools (IV)

New vertical persona packs (III)

Supporting analysis

Methodology & verification