MARKET-BENCH

Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Market Bench Podium

Top performing models ranked by lowest Mean MAE across all three trading strategies

Gemini 3 Pro logo
Gemini 3 Pro
MAE: 437.51
1
GPT 5.1 Codex-Max logo
GPT 5.1 Codex-Max
MAE: 3,102.74
2
Claude Sonnet 4.5 logo
Claude Sonnet 4.5
MAE: 5,459.68
3

Overall Methodology

We split the strategies up into 3 different strategies: easy, medium, and hard. Each strategy has an associated prompt. In this prompt, we detail out the strategy specifics and task the model to build a backtester which correctly integrates the strategy and outputs the required metrics in CSV format.

This CSV is then compared to our verifiable backtester's outputs to calculate MAE. We also test how reliable the model is using pass@1 and pass@3 metrics.

All three strategies require models to track and reserve the liquidity they remove from the book through simulated trades. We create a synthetic book that nets the raw order book data with consumed liquidity. Strategies 2 and 3 also include a delay between submitting an order and hearing back from the exchange, mirroring the real world.

Take a look at
each strategy

Check it out

A simple trading strategy where the model buys and sells MSFT stock at predetermined times using Databento's L10 order book data. Volume is randomized to enforce correct book reservation—if only 8 shares are available but 10 are requested, the model should only take 8. Tracks cash, position, realized/unrealized P&L, equity curve, and maximum drawdown.

A pairs trading mean-reversion strategy on Coke and Pepsi. The model calculates mid prices, creates a spread, and uses rolling z-scores to trigger entries/exits. When z-score exceeds entry threshold, it buys one leg and sells the other; positions flatten at the exit threshold. Includes cooldown mechanisms and shared capital accounting.

A realistic delta hedging strategy using random walk deltas and MSFT order book data. At regular intervals, the model evaluates net delta and trades to get flat. Uses fill-or-kill limit orders with fixed exchange delays. Tracks stock position, options delta, net delta, realized/unrealized P&L, and maximum drawdown.

Overall MARKET-BENCH performance across all three strategies

Modelpass@3pass@1Mean MAEAvg. attempts
gemini-3-pro-preview1.000.93437.511.07
gpt-5.1-codex-max1.000.933,102.741.07
claude-sonnet-4.51.000.935,459.681.07
qwen3-max1.000.80190,973,360.191.20
mistral-large-25120.930.7346,134.311.47
deepseek-v3.20.870.674,113.841.27
grok-40.870.47705.161.67
llama-4-maverick0.530.409,252.771.67
claude-opus-4.50.470.272,765.331.61
llama-3.1-nemotron-ultra0.330.2011,404.921.83
command-a0.330.079,817.972.50
nova-premier-v10.130.07688.571.50