backtesting

Multiple Testing

When many strategy variants are evaluated and the best selected, the probability of false discovery rises sharply, inflating backtested Sharpe Ratios.

Multiple testing (also called data snooping or data mining) is the statistical problem that arises when a researcher evaluates many strategy variants or parameter combinations and reports only the best result. The best-performing variant will look strong in-sample by construction, even if no true signal exists — its apparent strength reflects lucky coincidence in the training data.

If 100 uncorrelated random strategies are tested, approximately 5 will show 'significant' performance at the 5% significance level purely by chance. The more trials that are run, the worse the expected inflation of the winner's performance metric.

Corrections and safeguards

  • Bonferroni correction — adjust the significance threshold by dividing α by the number of tests. Conservative and often too strict for correlated tests.
  • Benjamini-Hochberg-Yekutieli (BHY) — controls the False Discovery Rate, less conservative than Bonferroni.
  • Deflated Sharpe Ratio — adjusts the reported SR directly for the number of trials, skewness, and kurtosis.
  • Probability of Backtest Overfitting — uses combinatorial cross-validation to estimate the probability that the best IS strategy underperforms out-of-sample.

Related terms

Related articles