$cd .. && cat leakage-safe-finance-llm-eval.md

leakage-safe finance llm eval

2026-06-01 17:30 gmt · ~5 min

most published finance llm benchmarks are dead within 6 months. not because the models get better. because the questions, the schemas, and the gold answers leak into the next training run. a benchmark you can't trust is worse than no benchmark. it's an alpha generator for whoever owns the leak channel.

where leakage comes from

- public release. the moment you publish question + answer pairs, they end up in next gen pre-training. even if the dataset is paywalled, scraped copies surface in months.
- schema theft. even if the answers aren't leaked, the schema is. models learn to recognise “this is a finagent question” and route to specialised heads.
- shared pre-history. evaluator and candidate model often share the same upstream tokens. you're measuring overlap in training, not capability.
- web-scraped grounding. agents fetch live data. yesterday's evaluation answer is today's search result.

what survival-grade looks like

- private hold-outs. never publish the evaluation set. publish methodology + score, not data.
- rolling generation. rotate questions quarterly. retire any item that surfaces in known training corpora.
- agent-level, not token-level. score the full agent trajectory, not just final answer. trajectory is harder to overfit on.
- classical ground truth. for quant tasks, the answer is derivable from a closed-form classical calculation. you don't need a human grader, and the model can't guess past the math.
- shared-pre-history correction. report candidate score alongside an estimated overlap with the evaluator's training distribution. if both grew from the same corpus, that's a confound, not a signal.

what i'm building

quantum finance gives a natural ground truth: classical pricing + risk libraries give the right answer for any well-posed question, deterministically, in closed form or via verified monte carlo. ai becomes the candidate, classical quant becomes the grader. private hold-outs survive longer because the grading function isn't a leaked answer key, it's a running calculation.