Metrics

Every metric Forven computes on a backtest, what it means, the reliability flags that suppress false confidence, and why you trust out-of-sample.

Every backtest Forven runs produces a block of metrics describing how a strategy behaved on historical data. This page explains each metric, the flags that mark a metric untrustworthy, and the one distinction that matters most: in-sample versus out-of-sample.

The short version: a strategy is split into a training window and an unseen window. Metrics computed on the unseen window are the ones the promotion gates read, and the ones you should read too.

Forven is a research tool. The numbers on these pages describe past behaviour on historical data. They are illustrative, they are not predictive, and nothing here is financial advice.

In-sample vs out-of-sample

Forven splits every backtest's data 70/30 at a fixed index:

  • In-sample (IS) — the first 70% of the data. Signals are generated and tuned here. IS Sharpe is optimistic and does not predict live behaviour.
  • Out-of-sample (OOS) — the last 30%, unseen during signal generation. OOS metrics are the ground truth for fitness gates and any deployment decision.

The lab renders IS and OOS side by side. A strategy that looks excellent IS and poor OOS is overfit. A strategy that holds up across both is worth promoting to paper (the public site calls this a "candidate").

A backtest result returns three blocks: in_sample, out_of_sample, and robustness. When you read a result, start with out_of_sample.

The core metrics

compute_metrics() derives 30+ values from each trade ledger. The ones you will read most often:

MetricWhat it measuresRead it as
Sharpe (annualized)Risk-adjusted return, scaled to a yearHigher is steadier; treat with care on short windows (see flags)
SortinoLike Sharpe but penalizes only downside volatilityHigher rewards strategies that are choppy on the upside, smooth on the downside
Max drawdown (%)Largest peak-to-trough equity declineLower is calmer; the pain you would have had to sit through
Profit factorGross profit ÷ gross lossAbove 1.0 means wins outweighed losses (see infinite-PF caveat below)
Win rateFraction of trades that closed positivePairs with payoff; a low win rate can still be fine
Payoff / avg win vs avg lossAverage winner size relative to average loserHigh payoff offsets a low win rate
Monthly / annualized returnReturn scaled to a month or yearIllustrative only; suppressed on short windows
Avg bars heldMean trade duration in barsTells you the holding horizon the metrics describe
Gross profit / gross lossRaw totals before nettingThe components behind profit factor

These are computed identically for the IS and OOS blocks, so you can compare the same strategy across the train/test split directly.

Breakdowns

Two slices help you see where performance came from:

  • by_side — long vs short. A strategy that only makes money on one side is really half a strategy.
  • by_regime — performance split across the four market regimes (trend_up, trend_down, range_bound, high_vol). This shows whether an edge is regime-specific or broad-based.

Robustness and walk-forward

A single 70/30 split is one test. The gauntlet goes further with walk-forward analysis (WFA) — N-fold cross-validation that re-runs the IS/OOS split across multiple windows.

Two robustness figures matter:

  • Robustness score — defined as 1.0 - max(IS→OOS Sharpe degradation, 0). A high score means performance held up from training to test; a low score means the edge decayed out-of-sample.
  • Degradation %1 - (avg_oos_sharpe / avg_is_sharpe) aggregated across folds. The WFA verdict is PASS when degradation is below 50% and there are at least 5 OOS trades. Degradation above 50% is the signature of overfitting.

Defaults: 5 folds, 70% IS / 30% OOS per fold (walkforward_folds, walkforward_train_ratio). WFA needs at least 420 bars of data; each fold needs enough OOS bars and warmup to be meaningful. Adequacy warnings are logged when an OOS window is shorter than 30 days or a fold has too few bars — the numbers still compute, but treat them as weak.

Reliability flags — read these first

Short windows produce metrics that look spectacular and mean nothing. Forven attaches boolean flags so the gates and the UI can suppress numbers that aren't yet trustworthy.

FlagTrue whenEffect
sharpe_is_reliableAt least 20 tradesSharpe is suppressed in displays/gates when false
annualized_return_reliableAt least 3 months of dataAnnualized return suppressed when false
profit_factor_is_infiniteA strategy had wins and zero lossesPF returned as infinity; downstream must handle it explicitly
funding_appliedAny trade had funding costs appliedProvenance — were funding costs in the picture at all
funding_completeAll trades had complete funding dataGates may reject a backtest run in a funding-blind window

Two reasons these exist:

  1. Sharpe annualization scales by sqrt(trades_per_year), which inflates wildly on short runs. Under 20 trades, the figure is flagged unreliable.
  2. Annualized return is capped at a 3-month window for display. A 25-day run can report a return in the thousands of percent — that is an artifact, not a result. The flag suppresses the display; it does not change the underlying fitness math.

When a flag is false, the number is hidden rather than shown with a false air of confidence. That is deliberate: transparency over confidence.

Funding-cost provenance

Forven applies HyperLiquid funding costs to backtest trades by default (backtest_include_funding). Because a funding-blind backtest can look better than reality, two flags record what actually happened:

  • funding_applied — at least one trade carried a funding cost.
  • funding_complete — every trade had complete funding history.

If funding_complete is false, the promotion gates may reject the run rather than promote a strategy validated in a window with missing funding data. On a fresh install the first backtest auto-backfills funding history from the exchange, so this self-heals over time.

How fees, slippage, and leverage enter the numbers

Metrics already include trading costs. The defaults applied to each round-trip:

CostConfig keyDefault
Feebacktest_fee_bps4.5 bps
Slippagebacktest_slippage_bps2.0 bps
Leverage(per-backtest)3x

Fees and slippage are applied as a round-trip cost (entry + exit). This means a strategy's metrics reflect a return after costs — there is no separate "gross vs net" toggle to forget.

How the gates read metrics

The promotion gates compute a single fitness score (0–100) from five weighted factors, all read from the OOS metrics:

fitness = 30% Sharpe (capped at 3.0)
        + 20% win rate
        + 20% profit factor (capped at 5.0)
        + 15% max drawdown (penalty: 0 at 10%+ drawdown)
        + 15% trade-count bonus (min 20 trades)

The thresholds that move a strategy along the pipeline:

  • >= 60 — eligible for paper (the public "candidate" stage)
  • >= 70 — eligible for deploy / live
  • < 40 — eligible for retirement

A metrics integrity check (check_metrics_integrity()) runs before any promotion. Impossible ratios, NaN, and other anomalies quarantine the result and block promotion — a strategy cannot be promoted on numbers that don't add up.

Reading a result via the API

You can pull the raw metrics directly. Submit a run and read the result:

# Submit a backtest
curl.exe -X POST http://127.0.0.1:8003/api/backtesting/run `
  -H "x-api-key: $env:FORVEN_API_KEY" `
  -H "content-type: application/json" `
  -d '{ "strategy_type": "rsi_momentum", "asset": "BTC", "timeframe": "1h" }'

# Retrieve the full ledger + metrics
curl.exe http://127.0.0.1:8003/api/backtesting/results/<result_id> `
  -H "x-api-key: $env:FORVEN_API_KEY"

The response carries the in_sample, out_of_sample, and robustness blocks described above. In the UI, the same data renders on the /backtest/{id} page as a chart, trades table, and metrics panel, with regime shadings on the chart.

Caveats

  • Trust OOS, not IS. In-sample metrics are tuning artifacts. The gates only read out-of-sample, and so should you.
  • Honor the flags. A great Sharpe under 20 trades or a four-digit annualized return on a three-week window is noise, not edge.
  • Infinite profit factor is real. A strategy with zero losses reports profit_factor = inf plus profit_factor_is_infinite. It usually means too few trades, not a perfect strategy.
  • Costs are baked in. Fees, slippage, and funding are already in the metrics; don't double-count them.
  • Past is not prologue. Every number here describes historical behaviour on historical data. It is illustrative, not predictive, and nothing here is financial advice.