Multiple testing (also called data snooping or data mining) is the statistical problem that arises when a researcher evaluates many strategy variants or parameter combinations and reports only the best result. The best-performing variant will look strong in-sample by construction, even if no true signal exists — its apparent strength reflects lucky coincidence in the training data.
If 100 uncorrelated random strategies are tested, approximately 5 will show 'significant' performance at the 5% significance level purely by chance. The more trials that are run, the worse the expected inflation of the winner's performance metric.
Corrections and safeguards
- Bonferroni correction — adjust the significance threshold by dividing α by the number of tests. Conservative and often too strict for correlated tests.
- Benjamini-Hochberg-Yekutieli (BHY) — controls the False Discovery Rate, less conservative than Bonferroni.
- Deflated Sharpe Ratio — adjusts the reported SR directly for the number of trials, skewness, and kurtosis.
- Probability of Backtest Overfitting — uses combinatorial cross-validation to estimate the probability that the best IS strategy underperforms out-of-sample.