$cd .. && cat leakage-safe-finance-llm-eval.md
leakage-safe finance llm eval
2026-06-01 17:30 gmt · ~5 min
most published finance llm benchmarks are dead within 6 months. not because the models get better. because the questions, the schemas, and the gold answers leak into the next training run. a benchmark you can't trust is worse than no benchmark. it's an alpha generator for whoever owns the leak channel.
where leakage comes from
- - public release. the moment you publish question + answer pairs, they end up in next gen pre-training. even if the dataset is paywalled, scraped copies surface in months.
- - schema theft. even if the answers aren't leaked, the schema is. models learn to recognise “this is a finagent question” and route to specialised heads.
- - shared pre-history. evaluator and candidate model often share the same upstream tokens. you're measuring overlap in training, not capability.
- - web-scraped grounding. agents fetch live data. yesterday's evaluation answer is today's search result.
what survival-grade looks like
- - private hold-outs. never publish the evaluation set. publish methodology + score, not data.
- - rolling generation. rotate questions quarterly. retire any item that surfaces in known training corpora.
- - agent-level, not token-level. score the full agent trajectory, not just final answer. trajectory is harder to overfit on.
- - classical ground truth. for quant tasks, the answer is derivable from a closed-form classical calculation. you don't need a human grader, and the model can't guess past the math.
- - shared-pre-history correction. report candidate score alongside an estimated overlap with the evaluator's training distribution. if both grew from the same corpus, that's a confound, not a signal.
what i'm building
quantum finance gives a natural ground truth: classical pricing + risk libraries give the right answer for any well-posed question, deterministically, in closed form or via verified monte carlo. ai becomes the candidate, classical quant becomes the grader. private hold-outs survive longer because the grading function isn't a leaked answer key, it's a running calculation.