How We Rank Trading Strategies — A Statistical Primer

Major · April 19, 2026, 7:45pm

What is it and why now ?

It is a quick dirty memo aiming to get you going with quant trading through AI agents.

TL/DR: Maximize Sharpe ratio, minimize P-value. Loop and repeat.

As some of you folks know, we’ll be pushing with GRIDNET OS Exchange UI dApp -a trading protocol independent exchange which from day-1 is going be be backed by Hyperliquid’s HyperCore.

Why? to provide a truly decentralized trading environment where all the aspects are decentralized. Even the front-end. Even exchange UI dApp discovery. So that one does not need to pray and bow towards Hyperliquid’s admins to get one’s assets openly visible in their UI.

Also - one of the main features which we are to be introducing later this year will be automated trading in either self-custodial an manage way. Anyone would be ale to deploy and manage a crypto ‘quant fund’ of their own, with everything being 100% open source and self manage.

More news to come.

We know many of you are hungry for getting started with quant trading deployments of their own so below we outline key constructs one needs to ask … Anthrophic’s Claude to optimize.

Good to share is it not.

P-value — the significance test we’ve been optimising since Day 1

The p-value in our trade-level metrics is the output of a permutation test on the trade-return series. Given T trade returns r₁, r₂, … r_T, we test the null hypothesis that the strategy has no edge — that the sign of each return is random. We shuffle the signs many times, recompute the Sharpe for each shuffle, and count how often the shuffled Sharpe exceeds the observed one. That fraction is our empirical p-value.

Concretely, a p-value of 0.0001 on a Sharpe of 2.33 means: in 10,000 random sign-permutations of the same trade magnitudes, fewer than one produced a Sharpe that high. The edge is extremely unlikely to be a coincidence of the sign pattern. A p-value of 0.30 means roughly a third of random sign-shufflings look this good — indistinguishable from noise.

P-value pre-dates PSR/DSR in the trading literature. It’s the classical significance test applied to backtests. It says nothing about selection bias across strategies (DSR’s job), nothing about non-normality adjustments (PSR’s job) — it answers the narrower, older question: “Could this Sharpe have arisen from a random permutation of the same trades?”

Why p-value is our primary optimisation target

There’s a subtle but important reason p-value sits higher than Sharpe in our optimisation priority:

Sharpe can be inflated by a few outlier trades. A strategy that wins 1000 small + loses 5 big can have Sharpe 2.0 with p=0.25. Not robust.
Sharpe gains that don’t shrink p are suspicious. If a researcher tunes parameters and Sharpe rises from 1.5 to 2.0 but p-value stays at 0.08, the “improvement” didn’t come from a more reliable signal — it likely came from overfitting the noise. We treat p-stagnant Sharpe gains as a red flag and usually revert.
p-value responds to reliability, not just magnitude. A strategy with Sharpe 1.5 and p=0.001 has a more trustworthy edge than one with Sharpe 2.5 and p=0.08.

Our standing directive for the research pipeline is: every round must push at least one strategy toward p < 0.05. Rounds that produce Sharpe improvements without significance improvements are flagged for scrutiny.

Walk-forward p-values — the harder bar

The single main-run p-value is not enough. Our walk-forward validation computes a p-value per fold — six non-overlapping 9-month windows. A strategy can game a single main run by accidentally aligning with one regime; it cannot game six independent regimes.

Rule 17 enforcement uses the walk-forward fold p-values: any fold with Sharpe significantly negative at p < 0.05 against the strategy’s own signal direction → permanent Production-Quality ineligibility. That’s a strategy that actively works backwards in some regime.
WF-quick gate (pre-screening for new candidates): we want median fold p < 0.30 across folds and avg fold Sharpe > 0. Candidates that don’t clear this never get a full backtest run.
Full WF gate: 4 out of 6 folds profitable (Sharpe > 0) AND at least one fold reaching p < 0.05 with positive Sharpe. This is what upgrades a candidate to “near-Production-Quality”.
The problem with “just use Sharpe”

When you evaluate a trading strategy you run a backtest and compute the Sharpe ratio — mean return divided by standard deviation of returns. A high Sharpe looks compelling and for a long time was the industry-standard single number. It has three fatal weaknesses that matter enormously in practice:

It assumes returns are Gaussian. Crypto returns are not. Distributions have fat tails (high kurtosis) and asymmetry (skewness). A strategy that “wins small, wins small, wins small, loses huge” produces an inflated Sharpe right up until the drawdown arrives.
It has no statistical significance attached. Sharpe 2.0 from 80 trades is not the same as Sharpe 2.0 from 3000 trades; classical Sharpe treats them identically. Without a sample size adjustment, every in-sample overfit looks great.
It ignores multiple testing. If you try 25 strategies and keep the best, that best Sharpe is a biased estimator of its own future performance. The probability that at least one of 25 random walks looks great at Sharpe 2.0 is not negligible. This is the same selection-bias problem that plagues academic p-hacking, applied to backtests.

Our statistical stack addresses all three, grounded in two key papers from Bailey & López de Prado — the researchers who formalised this problem for systematic trading.

Step 1 — Probabilistic Sharpe Ratio (PSR)

Bailey & López de Prado (2012, Journal of Risk) gave us the Probabilistic Sharpe Ratio. Given an observed Sharpe SR, the number of trade returns T, the skewness γ₃ and the raw kurtosis γ₄ of those returns, and a benchmark SR*, the PSR is:

PSR(SR*) = Φ[ (SR − SR*) · √(T − 1) / √(1 − γ₃·SR + (γ₄ − 1)/4 · SR²) ]

Φ is the standard normal CDF. The output is a probability in [0, 1]: the probability that the strategy’s true Sharpe exceeds the benchmark, accounting for the finite sample size and the non-normality of the return distribution.

For SR* = 0 this answers the question: “given what we observed, how confident can we be that this strategy has any genuine edge at all?”

A strategy with high Sharpe and low trade count will see its PSR drop — the correction for small samples. A strategy with extreme skewness or kurtosis will see the denominator penalty bite — a high observed Sharpe from a few outlier wins no longer looks special.

Step 2 — Deflated Sharpe Ratio (DSR)

The 2014 follow-up paper addresses the selection-bias problem. If you test N strategies, the expected maximum Sharpe under the null hypothesis (no real edge anywhere) is not zero — it grows with N. Testing more strategies means a higher bar for “significant”.

The Deflated Sharpe Ratio first computes the expected maximum Sharpe a cohort of N random strategies would produce under the null:

SR*_DSR = √V · [ (1 − γ_EM)·Φ⁻¹(1 − 1/N) + γ_EM·Φ⁻¹(1 − 1/(N·e)) ]

where V is the sample variance of the Sharpes across the N strategies actually tested, γ_EM ≈ 0.5772 is the Euler-Mascheroni constant, and Φ⁻¹ is the inverse normal CDF. Then:

DSR = PSR(SR*_DSR)

The DSR is the probability that a strategy beats what the best of N random candidates would achieve by chance. If DSR is 0.99, you can be about 99% confident the edge is real given how much searching went into finding it. If DSR is 0.10, the strategy’s observed performance is fully consistent with being the luckiest of a cohort of null-hypothesis strategies.

This is a direct, principled penalty on our own research activity. Every new strategy we test raises the bar for every existing strategy.

Step 3 — Three adaptations we make for crypto

The textbook formulas need three adjustments for a crypto context.

Kurtosis cap at γ₄ = 10. A single liquidation cascade on a crypto perp can produce a trade-return kurtosis in the range 50-200. The raw γ₄ plugged into the PSR denominator blows up (γ₄ − 1)/4 · SR² out of proportion to actual tail risk the strategy will face out-of-sample — it’s one bar’s noise dominating a multi-year statistic. We cap γ₄ at 10 (above Student-t₅ which is ~9), document every clamp in the compute log, and accept the slightly-conservative PSR that results.

Pooled-returns DSR as the canonical skill metric. A strategy that trades multiple assets realises a single concatenated return stream in production. Averaging per-asset DSRs would double-count correlated skill and is also outside the theoretical footing of the paper — Bailey’s formula was derived for one return series. Instead, we concatenate every asset’s trade-return series into a single pool, compute a single Sharpe, single skewness, single kurtosis over the pool, and one DSR. That pooled DSR is the number that drives the ranking. It evaluates the exact series the operator will realise live, and benefits from a larger T for tighter confidence intervals.

Credible-candidate cohort for N. For the deflation step we don’t use the raw count of strategies ever tried — many never produced a trade. We restrict N to the cohort of agents whose pooled PSR(0) > 0.5, i.e. the ones that at least plausibly have an edge. This is the right reference class for “how many serious candidates was the winner drawn from”. V is computed from that cohort’s pooled Sharpes.

Step 4 — Gates before the ranking matters

DSR by itself is not enough. We layer three hard gates in front of it, all classical out-of-sample techniques:

Walk-forward validation. Six non-overlapping 9-month windows (2021-launch, bear, recovery, bull-start, bull-continue, recent). A strategy must be profitable in at least 4/6 windows or it is deemed regime-specific and disqualified from Production Quality regardless of its pooled DSR.
Cross-asset validation. Models (HMM, GARCH, Kalman, etc.) are refit on a second asset (ETH, BTC, others) using the same parameter set optimised on SOL. The strategy logic must generalise; if an agent’s edge evaporates on ETH with ETH-fitted models, its SOL edge was likely asset-specific noise.
Rule 17 — permanent ineligibility. Any strategy that produces a walk-forward fold with Sharpe significantly negative at p < 0.05 against its own signal is permanently blocked from Production Quality. A strategy that actively loses money in one regime is not a strategy, it is an anti-strategy.

Pooled DSR is only applied to agents that have already cleared these gates. It sorts winners among candidates; it does not promote borderline candidates on its own.

Step 5 — How a strategy climbs to live deployment

The chain is:

Researcher proposes a new mechanism.
Backtest runs over 4.5 years of minute/15-minute data.
Walk-forward 4/6 gate + cross-asset cross-pair gate.
Rule 17 permanent-disqualification check.
Compute skewness, kurtosis (capped), Sharpe, p-value, PSR, DSR per asset and pooled.
Sort by pooled DSR. Candidates with pooled DSR ≥ 0.80 are eligible for Production Quality.
Human review of the mechanism (no opaque black boxes — we need the theoretical story).
Export as a snapshot-frozen bundle with SHA-256 and fitted models.
Deploy to a slot. Bar-aligned kline closes (same timestamp on primary asset and cross-asset features) feed the decision every 15 minutes.

Step 6 — Reading the numbers in plain English

Pooled DSR ≈ 1.00 — the strategy is effectively indistinguishable from “this is genuinely skilled”. Interpret as “almost certainly a real edge”.
Pooled DSR ≈ 0.80–0.95 — strong evidence of skill after multi-test correction. Safe to deploy. Monitor in live.
Pooled DSR ≈ 0.50 — coin-flip. The observed performance is fully consistent with being the luckiest draw from a cohort of null strategies.
Pooled DSR ≈ 0.05 — the strategy underperforms what a random selection of N candidates would deliver. Do not deploy.

What this approach cannot tell you

DSR is not a forecast. It is a statement about what you observed in the backtest, adjusted for how hard you looked. A DSR of 0.99 does not promise live Sharpe 2.0; it promises that if the underlying return-generating process stays what the backtest sampled, the strategy probably has real skill. Regime changes, structural breaks, venue-specific dynamics — those are separate risks that DSR is silent about, which is why we also run cross-asset validation, walk-forward partitioning, and enforce Rule 17.

There is also a venue-specific drift we accept as known-bounded noise: training may source primary and cross-asset data from Kraken or Binance depending on availability, while live trades on Hyperliquid. Prices differ by a handful of basis points across venues at the same timestamp. We have verified the timestamp-alignment contract holds across all three venues and accept the small distributional shift.

How p-value, PSR, and DSR fit together

They answer three different questions, and all three matter:

Metric	Question answered	Accounts for
p-value	“Could this Sharpe have arisen from a random sign-permutation of the same trades?”	Sample size only
PSR	“What is the probability the true Sharpe is greater than 0?”	Sample size + non-normality (skew, kurtosis)
DSR	“Given we tested N strategies, what is the probability this is the real one?”	Sample size + non-normality + selection bias

The relationship at a high level: PSR(0) ≈ 1 − p-value under certain distributional assumptions (Gaussian returns, no multiple testing). The further returns depart from Gaussian, the more PSR and (1 − p) diverge. DSR then further deflates PSR by the cohort-size correction.

In practice, we compute all three and require all three to look strong:

Main-run p-value < 0.05 — the Sharpe is not a permutation artefact.
Per-fold p-values — no catastrophic regime; Rule 17 clean.
Pooled DSR ≥ 0.80 — edge survives the selection-bias correction across our 25-agent roster.

A strategy with Sharpe 2.3 and main-run p = 0.0001 but pooled DSR 0.45 is not considered production-quality. The first two numbers say “this backtest wasn’t a fluke”; the third says “but we tested many strategies, and a few this good were expected by chance — we need more evidence before deploying”.

The hierarchy in practice

p-value is the trigger condition — a research round that doesn’t improve at least one agent’s p-value is considered a null round, even if Sharpe moved around.
DSR is the selection filter — among the cohort of p-significant candidates, DSR sorts who actually earns a Production-Quality slot.
Walk-forward p-values are the regime sieve — both a prerequisite (need positive Sharpe in ≥4/6 folds with ≥1 fold significant) and a disqualifier (Rule 17 — no fold significantly negative).

References

Bailey, D. H. & López de Prado, M. (2012). “The Sharpe Ratio Efficient Frontier.” Journal of Risk 15(2).
Bailey, D. H. & López de Prado, M. (2014). “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management 40(5).
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 5 (fractional differentiation) and 14 (backtest overfitting) for the broader context.