Metrics
Every metric Forven computes on a backtest, what it means, the reliability flags that suppress false confidence, and why you trust out-of-sample.
Every backtest Forven runs produces a block of metrics describing how a strategy behaved on historical data. This page explains each metric, the flags that mark a metric untrustworthy, and the one distinction that matters most: in-sample versus out-of-sample.
The short version: a strategy is split into a training window and an unseen window. Metrics computed on the unseen window are the ones the promotion gates read, and the ones you should read too.
Forven is a research tool. The numbers on these pages describe past behaviour on historical data. They are illustrative, they are not predictive, and nothing here is financial advice.
In-sample vs out-of-sample
Forven splits every backtest's data 70/30 at a fixed index:
- In-sample (IS) — the first 70% of the data. Signals are generated and tuned here. IS Sharpe is optimistic and does not predict live behaviour.
- Out-of-sample (OOS) — the last 30%, unseen during signal generation. OOS metrics are the ground truth for fitness gates and any deployment decision.
The lab renders IS and OOS side by side. A strategy that looks excellent IS and poor OOS is overfit. A strategy that holds up across both is worth promoting to paper (the public site calls this a "candidate").
A backtest result returns three blocks: in_sample, out_of_sample, and robustness. When you read a result, start with out_of_sample.
The core metrics
compute_metrics() derives 30+ values from each trade ledger. The ones you will read most often:
| Metric | What it measures | Read it as |
|---|---|---|
| Sharpe (annualized) | Risk-adjusted return, scaled to a year | Higher is steadier; treat with care on short windows (see flags) |
| Sortino | Like Sharpe but penalizes only downside volatility | Higher rewards strategies that are choppy on the upside, smooth on the downside |
| Max drawdown (%) | Largest peak-to-trough equity decline | Lower is calmer; the pain you would have had to sit through |
| Profit factor | Gross profit ÷ gross loss | Above 1.0 means wins outweighed losses (see infinite-PF caveat below) |
| Win rate | Fraction of trades that closed positive | Pairs with payoff; a low win rate can still be fine |
| Payoff / avg win vs avg loss | Average winner size relative to average loser | High payoff offsets a low win rate |
| Monthly / annualized return | Return scaled to a month or year | Illustrative only; suppressed on short windows |
| Avg bars held | Mean trade duration in bars | Tells you the holding horizon the metrics describe |
| Gross profit / gross loss | Raw totals before netting | The components behind profit factor |
These are computed identically for the IS and OOS blocks, so you can compare the same strategy across the train/test split directly.
Breakdowns
Two slices help you see where performance came from:
by_side— long vs short. A strategy that only makes money on one side is really half a strategy.by_regime— performance split across the four market regimes (trend_up,trend_down,range_bound,high_vol). This shows whether an edge is regime-specific or broad-based.
Robustness and walk-forward
A single 70/30 split is one test. The gauntlet goes further with walk-forward analysis (WFA) — N-fold cross-validation that re-runs the IS/OOS split across multiple windows.
Two robustness figures matter:
- Robustness score — defined as
1.0 - max(IS→OOS Sharpe degradation, 0). A high score means performance held up from training to test; a low score means the edge decayed out-of-sample. - Degradation % —
1 - (avg_oos_sharpe / avg_is_sharpe)aggregated across folds. The WFA verdict is PASS when degradation is below 50% and there are at least 5 OOS trades. Degradation above 50% is the signature of overfitting.
Defaults: 5 folds, 70% IS / 30% OOS per fold (walkforward_folds, walkforward_train_ratio). WFA needs at least 420 bars of data; each fold needs enough OOS bars and warmup to be meaningful. Adequacy warnings are logged when an OOS window is shorter than 30 days or a fold has too few bars — the numbers still compute, but treat them as weak.
Reliability flags — read these first
Short windows produce metrics that look spectacular and mean nothing. Forven attaches boolean flags so the gates and the UI can suppress numbers that aren't yet trustworthy.
| Flag | True when | Effect |
|---|---|---|
sharpe_is_reliable | At least 20 trades | Sharpe is suppressed in displays/gates when false |
annualized_return_reliable | At least 3 months of data | Annualized return suppressed when false |
profit_factor_is_infinite | A strategy had wins and zero losses | PF returned as infinity; downstream must handle it explicitly |
funding_applied | Any trade had funding costs applied | Provenance — were funding costs in the picture at all |
funding_complete | All trades had complete funding data | Gates may reject a backtest run in a funding-blind window |
Two reasons these exist:
- Sharpe annualization scales by
sqrt(trades_per_year), which inflates wildly on short runs. Under 20 trades, the figure is flagged unreliable. - Annualized return is capped at a 3-month window for display. A 25-day run can report a return in the thousands of percent — that is an artifact, not a result. The flag suppresses the display; it does not change the underlying fitness math.
When a flag is false, the number is hidden rather than shown with a false air of confidence. That is deliberate: transparency over confidence.
Funding-cost provenance
Forven applies HyperLiquid funding costs to backtest trades by default (backtest_include_funding). Because a funding-blind backtest can look better than reality, two flags record what actually happened:
funding_applied— at least one trade carried a funding cost.funding_complete— every trade had complete funding history.
If funding_complete is false, the promotion gates may reject the run rather than promote a strategy validated in a window with missing funding data. On a fresh install the first backtest auto-backfills funding history from the exchange, so this self-heals over time.
How fees, slippage, and leverage enter the numbers
Metrics already include trading costs. The defaults applied to each round-trip:
| Cost | Config key | Default |
|---|---|---|
| Fee | backtest_fee_bps | 4.5 bps |
| Slippage | backtest_slippage_bps | 2.0 bps |
| Leverage | (per-backtest) | 3x |
Fees and slippage are applied as a round-trip cost (entry + exit). This means a strategy's metrics reflect a return after costs — there is no separate "gross vs net" toggle to forget.
How the gates read metrics
The promotion gates compute a single fitness score (0–100) from five weighted factors, all read from the OOS metrics:
fitness = 30% Sharpe (capped at 3.0)
+ 20% win rate
+ 20% profit factor (capped at 5.0)
+ 15% max drawdown (penalty: 0 at 10%+ drawdown)
+ 15% trade-count bonus (min 20 trades)The thresholds that move a strategy along the pipeline:
- >= 60 — eligible for
paper(the public "candidate" stage) - >= 70 — eligible for deploy /
live - < 40 — eligible for retirement
A metrics integrity check (check_metrics_integrity()) runs before any promotion. Impossible ratios, NaN, and other anomalies quarantine the result and block promotion — a strategy cannot be promoted on numbers that don't add up.
Reading a result via the API
You can pull the raw metrics directly. Submit a run and read the result:
# Submit a backtest
curl.exe -X POST http://127.0.0.1:8003/api/backtesting/run `
-H "x-api-key: $env:FORVEN_API_KEY" `
-H "content-type: application/json" `
-d '{ "strategy_type": "rsi_momentum", "asset": "BTC", "timeframe": "1h" }'
# Retrieve the full ledger + metrics
curl.exe http://127.0.0.1:8003/api/backtesting/results/<result_id> `
-H "x-api-key: $env:FORVEN_API_KEY"The response carries the in_sample, out_of_sample, and robustness blocks described above. In the UI, the same data renders on the /backtest/{id} page as a chart, trades table, and metrics panel, with regime shadings on the chart.
Caveats
- Trust OOS, not IS. In-sample metrics are tuning artifacts. The gates only read out-of-sample, and so should you.
- Honor the flags. A great Sharpe under 20 trades or a four-digit annualized return on a three-week window is noise, not edge.
- Infinite profit factor is real. A strategy with zero losses reports
profit_factor = infplusprofit_factor_is_infinite. It usually means too few trades, not a perfect strategy. - Costs are baked in. Fees, slippage, and funding are already in the metrics; don't double-count them.
- Past is not prologue. Every number here describes historical behaviour on historical data. It is illustrative, not predictive, and nothing here is financial advice.
Related
- The gauntlet — the full robustness battery these metrics feed
- Backtesting a strategy — where these metrics are produced
- Promotion gates — how fitness and reliability flags gate the pipeline
- Market regimes — the regimes behind the
by_regimebreakdown
Market Regimes
How Forven classifies markets into four regimes — TREND_UP, TREND_DOWN, RANGE_BOUND, HIGH_VOL — and gates each strategy to the conditions it was built for.
Promotion gates
The gates between every pipeline stage — quick_screen overfitting guardrails, fitness scoring, the lean paper gate, the strict paper→live gate, and operator overrides.