When Direction Prediction Fails, Trade Relationships Instead
I spent months building an ML pipeline to predict the direction of a commodity futures contract. 65 features, purged cross-validation, meta-labeling, the full stack from Lopez de Prado's textbooks. The model achieved 55% accuracy and lost 89% while buy-and-hold gained 258%.
I have written about why that happened and about the data engineering we built for it. This post is about what we did after the failure.
The short version: we stopped predicting direction entirely. We started trading relationships between correlated assets. It was the only approach that beat buy-and-hold out of sample.
Why Direction Prediction Is So Hard
The core problem is not the model. It is the question. "Will price go up tomorrow?" requires the model to predict the net effect of every force acting on the market: macro policy, geopolitics, institutional flows, retail sentiment, algorithmic activity, option hedging, seasonal patterns. Some of these forces are observable. Most are not.
Worse, the relative importance of these forces changes over time. In our training data (2014-2023), real yield was the dominant driver. When real yields rose, the asset fell. The model learned this relationship correctly. Then in 2024-2026, the relationship broke. Real yields stayed positive but prices rallied anyway, driven by sovereign buying that the model had never seen in training.
This is not a bug in the model. It is a structural limitation of direction prediction in non-stationary systems. The relationships your model learns are conditional on the regime that produced the training data. When the regime changes, the model is confidently wrong.
The Pivot: Relative Value
Instead of asking "will price go up?", we started asking a different question: "is this asset outperforming or underperforming a correlated asset?"
This question is fundamentally different for two reasons:
-
Ratios between correlated assets are more stationary than raw prices. The ratio fluctuates around a long-term mean. Even when both assets are trending, their ratio tends to be range-bound.
-
Momentum in the ratio is persistent. When one asset starts outperforming, it tends to continue for weeks or months. This is not a prediction about the future. It is an observation that trends in relative strength have inertia.
We tested this using the ratio between our primary asset and a closely correlated one in the same sector. Over 25 years of data, the ratio ranged from roughly 32 to 127, with a mean around 70.
The signal: when momentum in the ratio is rising (primary asset gaining strength relative to the correlated one), the trend is likely to continue. When momentum is falling, reduce exposure.
The Strategy
The implementation is almost embarrassingly simple. Two inputs: the daily close of the primary asset and the daily close of the correlated asset.
ratio = primary_close / correlated_close
ratio_momentum = ratio.rolling(20).mean() > ratio.rolling(60).mean()
ema_50 = primary_close.ewm(span=50).mean()
ema_200 = primary_close.ewm(span=200).mean()
trend_up = ema_50 > ema_200Position sizing:
- Both signals bullish (ratio momentum + trend up): full position (1.0x)
- One signal bearish: reduced position (0.3x)
- Both signals bearish: minimal position (0.09x)
That is the entire strategy. No ML model. No feature engineering. No meta-labeling. Two inputs, two signals, three position states.
The Results
We tested this on the full 23-year history (2006-2026), using walk-forward validation with 5-year training windows and 1-year test windows, rolling quarterly.
| Metric | Buy & Hold | Relative Value Strategy |
|---|---|---|
| Sharpe Ratio (OOS) | 0.96 | 0.97 |
| Max Drawdown | -21.3% | -13.2% |
| Sharpe / Max DD | 4.50 | 7.35 |
| Annual Return | ~18% | ~16% |
| Trades per Year | 0 | ~45 |
Raw Sharpe is nearly identical. But maximum drawdown dropped from 21.3% to 13.2%. The risk-adjusted return (Sharpe per unit of max drawdown) improved by 63%.
You give up roughly 2% annual return. You gain 40% less drawdown. For most capital allocators, that is a trade worth making.
Robustness Testing
The results need to survive more than one test.
Parameter sweep (25 combinations): We varied the momentum lookback windows and EMA periods across 25 parameter combinations. 92% of combinations beat buy-and-hold on risk-adjusted metrics. If only the optimal parameter set works, you have overfit. When 92% work, the signal is real.
Monte Carlo bootstrap (10,000 iterations): Random resampling of returns with replacement. P(strategy Sharpe > 0.5) = 92.4%. P(strategy beats buy-and-hold) = 61.0%.
Cost sensitivity: Profitable across spread assumptions from 1 to 10 points. At 3 points (realistic for our asset), the results were essentially unchanged.
Walk-forward: Sharpe 0.61 vs buy-and-hold 0.59 in the most conservative walk-forward test. Tighter than the full-period numbers, but still positive.
Why This Works When Direction Prediction Does Not
The ratio momentum signal sidesteps the regime shift problem. It does not need to know why prices are rising. It only needs to observe that the primary asset is gaining relative strength. This observation is valid regardless of whether the driver is real yield, geopolitical risk, or central bank buying.
When the fundamental relationships change, the ratio often signals it before the individual asset does. If the primary asset starts underperforming its peer while still nominally trending up, that divergence is information. The 65-feature ML model could not see this because it was looking at the primary asset in isolation.
Relative value also naturally hedges against broad market moves. If both assets fall during a risk-off event but the primary one falls less, the ratio actually improves. A pure direction model sees "price went down" and panics.
What We Keep ML For
We did not abandon ML entirely. We just moved it to where it actually helps:
Regime classification: A 3-state Hidden Markov Model on returns and volatility correctly identifies bull, range, and bear periods. We do not use this for prediction. We use it for position sizing. Full size in confirmed trends, reduced size in range-bound periods.
Parameter optimization: Bayesian optimization runs as a daily batch job, tuning the lookback windows and EMA periods based on recent performance. Zero inference latency because it runs offline. The model adjusts parameters, not signals.
Anomaly detection: An autoencoder on the feature distribution flags when current market conditions look unlike anything in the training set. When the reconstruction error spikes, we reduce position size. This is the "I don't know" detector. It catches regime shifts before the signal model does.
All three of these applications share a property: they do not predict direction. They classify, optimize, or detect. These are problems where ML has a genuine advantage over simple heuristics. Direction prediction in trending markets is not.
The Meta-Lesson
After building 65 features, implementing every advanced technique from the quantitative finance literature, and watching it lose 89%, the thing that actually worked was a ratio between two correlated assets with a 20/60-day moving average crossover.
I do not think this means ML is useless for trading. I think it means the field has a question-selection problem. The default question everyone asks ("will price go up?") is the hardest possible question to answer with historical data. Regime shifts, non-stationarity, and adversarial market participants make it a moving target.
Easier questions exist. "Is this asset outperforming its peers?" "Is the current regime similar to past regimes?" "Are the model's input distributions drifting?" These questions are answerable, stable, and genuinely useful for risk management and position sizing.
The 65-feature pipeline was not wasted work. We still use the data infrastructure, the dollar bars, the FFD features, and the purged CV framework. But we use them for classification and anomaly detection, not for the prediction that destroyed our backtest.
The best trading systems I have built are not the most sophisticated. They are the ones that ask questions the data can actually answer.