Can a machine learning model beat buy-and-hold on Indonesian stocks?
This project builds, tests, and rigorously evaluates a trading strategy powered by XGBoost — one of the most battle-tested ML algorithms in quantitative finance.

Imagine you could teach a computer to study 15 years of stock price history — every wiggle, every trend, every indicator traders use — and learn when a stock is most likely to rise in the next few days. That’s exactly what this framework does.
It trains an XGBoost machine learning model on historical stock data, then simulates trading based on that model’s signals, and finally presents a full performance dashboard so you can judge for yourself: is this strategy worth anything in the real world?
The framework is built with a trader’s mindset: it accounts for realistic trade execution, transaction costs, and — critically — makes sure the model never “cheats” by peeking at future data during training.
| Goal | How It’s Addressed |
|---|---|
| Predict short-term price direction | XGBoost classifier on 80+ technical features |
| Avoid overfitting (the #1 failure of ML in trading) | Walk-forward retraining + purged cross-validation + feature selection |
| Simulate real-world trading | Execution at next day’s open price + transaction costs |
| Measure true out-of-sample performance | Strict train / validation / hold-out data splits |
| Understand why the model trades | Feature importance chart |
Stock price data (OHLCV: Open, High, Low, Close, Volume) is downloaded automatically via yfinance for any ticker — Indonesian stocks (.JK), US stocks, or any market supported by Yahoo Finance. The default lookback is 15 years.
Raw prices are transformed into features that describe the current market state. These are the same signals technical traders use, but fed to a machine instead of human eyes:
All features are normalized relative to price (e.g., (Close - SMA) / SMA) so they stay comparable across different stocks and time periods.
The model learns to predict: “Will this stock rise by at least X% over the next N days?”
1 (buy signal) if the stock gains > label_pct% in forward_days trading days, else 0 (stay flat).-1 label (sell/short signal) for stocks expected to drop.The default is 3-day horizon with a 1% threshold — asking the model to predict meaningful near-term moves.
The dataset is split chronologically — never randomly — into three non-overlapping periods:
|──────────── In-Sample (60%) ────────────|── Validation (20%) ──|── Hold-Out (20%) ──|
Model is trained here Hyperparameter tuning True test of reality
Why this matters: Most “backtests” you see online are overfit — the model was implicitly tuned on the test data. This framework enforces a strict wall between learning and evaluation.
XGBoost (eXtreme Gradient Boosting) is an ensemble of decision trees that learns from its own mistakes iteratively. It’s the algorithm behind many winning solutions in quantitative trading competitions.
Key anti-overfitting measures baked in:
| Parameter | Setting | Effect |
|---|---|---|
max_depth = 3 |
Shallow trees | Can’t memorize noise |
learning_rate = 0.02 |
Small steps | Generalizes better |
colsample_bytree = 0.5 |
Use 50% of features per tree | Forces diversity |
min_child_weight = 8 |
Need 8+ samples per leaf | Avoids spurious splits |
reg_alpha/lambda |
L1 + L2 regularization | Penalizes complexity |
early_stopping_rounds = 40 |
Stop if val loss doesn’t improve | Prevents overtraining |
scale_pos_weight |
Auto class balancing | Handles rare buy signals |
Standard k-fold cross-validation is broken for time series — it lets the model train on data after the test period, leaking future information.
This framework uses Purged + Embargoed Walk-Forward CV:
This gap removes rows that might share information with the test period (since a 3-day forward label computed on day T overlaps with prices on days T+1 through T+3).
The output is a CV AUC score — a leakage-free estimate of how well the model distinguishes buy opportunities from non-opportunities.
Training on all 80+ features often hurts performance — noise overwhelms signal. After initial training, the framework ranks features by XGBoost importance and keeps only the top-K (default: 15). The model is then retrained on this curated feature set.
In the example dashboard (BBCA.JK), the top features were SMA deviations, EMA distances, rate-of-change, MACD, and Bollinger Band position — classic trend-following and mean-reversion signals.
Markets change. A model trained in 2015 may be completely wrong in 2023. To handle this:
This is how professional quant funds operate. It’s the difference between a static snapshot and a living, adapting system.
|─ IS train ─|── retrain ──|── retrain ──|── retrain ──|
↓ ↓ ↓
predict → predict → predict →
Raw probability outputs from the model are converted to trading signals using a dual filter:
An optional SMA regime filter (e.g., SMA-200) can restrict longs to bull markets only, significantly reducing drawdown.
The backtest is designed to be as close to real trading as possible:
The dashboard produced by the script has 6 panels:

You want the blue line to stay above orange, especially in the Hold-Out region — that’s the only part that counts.
sma10_d (deviation from 10-day SMA) dominated — the model learned to buy pullbacks from trend.| 🔺 Green triangles = BUY entries | 🔻 Red triangles = SELL entries | ✕ = exits |
| Metric | What It Means | Good Value |
|---|---|---|
| Total Return | Cumulative gain over the period | Higher than B&H |
| B&H Return | What you’d earn just holding | The benchmark to beat |
| Ann. Return | Annualized compound return | > 15% is strong |
| Ann. Vol | Annualized daily volatility | Lower = smoother ride |
| Sharpe | Return per unit of risk | > 1.0 is good, > 1.5 is great |
| Sortino | Like Sharpe but only penalizes downside volatility | > 1.0 is solid |
| Max DD | Worst peak-to-trough drawdown | Closer to 0% is better |
| Calmar | Ann. Return / Max Drawdown | > 1.0 means returns justify the risk |
| Win Rate | % of trading days that were profitable | Often 50–60% is fine with good Sharpe |
| # Trades | Total position changes (entries + exits) | Lower = less friction |
| CV AUC | Cross-validation discrimination score | > 0.55 is meaningful signal |

pip install yfinance xgboost scikit-learn pandas numpy matplotlib joblib python-dateutil
python ML_Stock_Backtest.py
The output will be:
xgb_backtest_dashboard.pngxgb_{TICKER}_{MODE}_{N}d.pkl (for live use)Edit the CONFIG dict at the top of the file:
CONFIG = dict(
ticker = "BBCA.JK", # Any Yahoo Finance ticker
years = 15, # Years of history to download
mode = "buy_only", # "buy_only" or "buy_sell"
forward_days = 3, # Prediction horizon (trading days)
label_pct = 1.0, # % move threshold to label as "buy"
retrain_months = 6, # Walk-forward retraining frequency
top_k_features = 15, # Feature selection cutoff
regime_sma = None, # e.g. 200 → only buy above SMA(200)
buy_pct = 25, # Top N% confidence percentile for BUY
min_prob_floor = 0.40, # Minimum raw probability to trade
transaction_cost_bps = 5, # Round-trip cost in basis points
)
This is a research and educational framework, not a live trading system. Before drawing conclusions:
| Decision | Rationale |
|---|---|
| XGBoost over deep learning | More robust on tabular data with limited samples; interpretable |
| Expanding window (not rolling) | Maximizes training data; avoids discarding early regime information |
| Percentile-rank signals (not raw probs) | Self-adjusting threshold; robust to distribution shift |
| Open-price execution | Removes look-ahead bias; more realistic than close-to-close |
| RobustScaler | Less sensitive to extreme outliers than StandardScaler |
scale_pos_weight |
Handles class imbalance without undersampling |
├── ML_Stock_Backtest.py # Main framework
├── README.md # This file
├── BBCA_JK_xgb_backtest_dashboard.png # Example dashboard output
└── xgb_BBCA.JK_buy_only_3d.pkl # Saved model bundle (after running)
MIT — use freely, contribute back if you improve it.
Built with Python · XGBoost · scikit-learn · yfinance · matplotlib