One common question we often receive from our readers is: “How do you evaluate the effectiveness of a trading strategy?” In this post, we’ll explore two fundamental techniques used in quantitative research to assess whether a trading strategy may genuinely offer an advantage or if its performance is likely due to random chance. These techniques are p-values from statistical tests and bootstrapping methods. We’ll break them down with simple examples to make them easy to understand, even if you’re not a math or coding expert.
Understanding the Basics: What Are We Evaluating?
Imagine you’ve backtested two trading strategies, Strategy A and Strategy B. After backtesting over 500 days, here’s what you find:
- Both strategies have the same average daily return of 0.10%.
- Strategy A has low volatility (0.90% per day).
- Strategy B has higher volatility (2.50% per day).
Given the same average daily returns, the two strategies obtained the same overall total return after 500 days (note: we are using log returns here for simplicity). However, the path each strategy takes to reach those returns can be quite different due to their varying volatilities
However, the path each strategy takes to reach those returns can be quite different due to their varying volatilities.
The chart above plots Strategy A in blue and Strategy B in red. Strategy A shows steady growth with minimal fluctuations, while Strategy B exhibits more significant ups and downs but ultimately reaches the same cumulative return as Strategy A after 500 days.
Despite achieving the same overall return, Strategy B’s higher volatility leads to more extreme movements—both gains and losses—compared to the more stable Strategy A.
Statistical Testing: What Does the p-Value Tell Us?.
To determine whether the observed returns are statistically significant or could have occurred by chance, we use a common statistical testing tool called a one-sided t-test.
The idea behind a one-sided t-test is to verify analytically what is the probability of observing a similar or greater mean (expected return) even in the absence of a real trading edge. This probability is called the p-value.
More formally, we are testing the hypothesis of the absence of a trading edge against the alternative hypothesis of a positive edge.
- Null Hypothesis (H₀): The strategy has no positive edge (average return ≤ 0).
- Alternative Hypothesis (H₁): The strategy does have a positive edge (average return > 0).
After performing the t-test on both strategies, we get: Strategy A: p-value of 0.90% Strategy B: p-value of 18.50%
Interpreting the p-Values:
Lower p-value (Strategy A): Indicates a very low probability that Strategy A’s positive returns are due to random chance. This strengthens our confidence that Strategy A truly has a positive edge.
Higher p-value (Strategy B): Suggests an 18.5% probability that Strategy B’s returns could occur even if the strategy had no real advantage. The magnitude of this value indicates less confidence compared to Strategy A.
As anticipated above, the p-value helps us understand the likelihood of observing such returns if the strategy had no real advantage. A low p-value (we like to consider only p-values below 1%) means it’s unlikely the observed returns are due to random chance, suggesting the strategy has a genuine edge. Conversely, a higher p-value suggests the returns could easily be a result of randomness, meaning we should have less confidence in the strategy’s effectiveness.
Connection to False Positives:
The p-value is directly related to the concept of False Positive events, also known as Type I errors. A False Positive occurs when you consider a strategy profitable that in reality has no edge and whose profitability was simply due to luck. Specifically, a low p-value indicates that the observed returns are highly unlikely under the null hypothesis. By setting a significance level (we use 1%), we control the maximum probability of committing a Type I error.
Therefore, when the p-value is below this threshold, we reject the null hypothesis, accepting the risk (set by the significance level) of mistakenly identifying a strategy as profitable when it’s not.
Monte Carlo Simulation: Visualizing p-Values
To better grasp what a p-value of 18.5% for Strategy B means, we can use a Monte Carlo simulation to visualize this probability. A Monte Carlo simulation involves generating numerous random return paths based on specific assumptions—in this case, assuming the strategy has no real advantage (average return = 0) but maintains the same volatility as Strategy B (2.50%).
Steps:
Simulate 1,000 Return Streams: Assume daily returns are normally distributed with zero mean (no edge) and a daily volatility of 2.50%.
Calculate Cumulative Returns: For each simulated path, compute the cumulative return.
Compare to Strategy B’s Returns: Determine how many simulated paths achieve a cumulative return equal to or greater than Strategy B.
The left chart represents the cumulative return trajectory of each simulated path. In red, we depicted the trajectory for the backtested Strategy B.
On the right chart, we plot a histogram that allows us to better visualize the probability of observing a cumulative return that is greater than or equal to the cumulative return of Strategy B (green bars). Out of 1,000 simulated paths, 18.7% realized a total return above Strategy B. This probability is just 0.2% higher than the p-value obtained analytically, simply due to the inherent randomness in Monte Carlo simulations.
This p-value tells us there’s an 18.5% chance that Strategy B’s performance could be achieved (or exceeded) purely by luck, even if the strategy had no real edge. This quantifies the likelihood of False Positive events—where we mistakenly believe a strategy is profitable when it’s not.
Bootstrapping Techniques: Building Confidence in Your Strategy
Another powerful method to evaluate a trading strategy is called bootstrapping and involves resampling the original return data to create new simulated return paths. Here, the goal is to simulate new return paths using the original returns features (daily mean of 0.10% and volatility of 2.50%) and compute the probability of obtaining a negative total return after N-days.
Here are the steps:
- Resample with Replacement: Take your original daily returns for Strategy B and randomly pick returns to create new time-series of 252 days (approximately 1 year). This means some days’ returns may be repeated while others might be omitted.
- Generate 1,000 Bootstrapped Paths: Create 1,000 new return streams through this resampling process.
- Analyze Outcomes: Calculate the cumulative return for each bootstrapped path and determine how many result in a profit (cumulative return > 0) versus a loss.
The above figure shows the simulated bootstrapped paths (on the left) and the distribution of the cumulative returns of all the paths. As depicted with green bars, 74.4% of the bootstrapped paths were profitable, while 25.6% resulted in a loss.
The 25.6% probability reflects the risk of a False Negative—believing that a strategy has lost its edge when it actually possesses one. Essentially, it’s the chance that the strategy’s inner variability (return volatility) caused the strategy to perform poorly. But this has nothing to do with the edge having disappeared.
Projection for the Future:
Bootstrapping doesn’t just validate past performance—it helps project what we should expect going forward from a strategy whose returns are similar to those observed during the backtest. By generating new simulated paths based on historical data, bootstrapping allows us to identify when an out-of-sample trajectory deviates substantially from what was implied during the in-sample backtested period. This means you can spot unusual performance trends early and make more informed decisions about whether to continue using or adjust the strategy.
A Non-Parametric Approach
One of the standout features of bootstrapping is that it does not require assumptions about the underlying distribution of returns (such as normal distribution). This non-parametric approach makes bootstrapping a versatile tool, especially in cases where the return distributions may be skewed or exhibit fat tails that deviate from normality. By relying solely on the empirical data from the backtest, bootstrapping provides a robust method for assessing strategy performance without being constrained by predefined distributional shapes.
Key Takeaways
When evaluating the effectiveness of a trading strategy, we may use these statistical methods:
- One-Side T-Tests and p-Value (Type I Error Risk):
Definition: The probability of mistakenly identifying a strategy as profitable when it’s not.
Goal: Aim for a low p-value (ideally < 1%) to minimize this risk.
- Bootstrapping Probability (Type II Error Risk):
Definition: The probability of experiencing negative overall returns not because of a lack of edge but due to variability in returns.
Goal: Aim for a high probability of achieving profitable paths to reduce this risk.
In Summary:
Low p-value: Increases confidence that the strategy’s positive returns aren’t just due to random chance.
High Bootstrapping Success Rate: Reinforces the strategy’s robustness against volatility.
Strategy A would be a perfect candidate given the low p-value (0.90%) and an almost negligible probability (3.5%) of having a negative total return after 1 year using the bootstraping methdology descriped above.
In future discussions, we’ll explore the Bonferroni correction, a statistical method used to reduce the risk of False Positives when evaluating multiple trading strategies simultaneously. This technique ensures that the more strategies you test, the more cautious you become in declaring any single strategy as truly effective.
Understanding and applying these statistical techniques can significantly enhance the reliability of your trading strategies, ensuring that your decisions are backed by robust data analysis rather than mere chance.
In case you have any questions or comments, do not hesitate to write DM or reach me out at carlo@concretumgroup.com