Metrics

Every metric Forven computes on a backtest, what it means, the reliability flags that suppress false confidence, and why you trust out-of-sample.

Every backtest Forven runs produces a block of metrics describing how a strategy behaved on historical data. This page explains each metric, the flags that mark a metric untrustworthy, and the one distinction that matters most: in-sample versus out-of-sample.

The short version: a strategy is split into a training window and an unseen window. Metrics computed on the unseen window are the ones the promotion gates read, and the ones you should read too.

Forven is a research tool. The numbers on these pages describe past behaviour on historical data. They are illustrative, they are not predictive, and nothing here is financial advice.

In-sample vs out-of-sample

Forven splits every backtest's data 70/30 at a fixed index:

In-sample (IS) — the first 70% of the data. Signals are generated and tuned here. IS Sharpe is optimistic and does not predict live behaviour.
Out-of-sample (OOS) — the last 30%, unseen during signal generation. OOS metrics are the ground truth for fitness gates and any deployment decision.

The lab renders IS and OOS side by side. A strategy that looks excellent IS and poor OOS is overfit. A strategy that holds up across both is worth promoting to paper (the public site calls this a "candidate").

A backtest result returns three blocks: in_sample, out_of_sample, and robustness. When you read a result, start with out_of_sample.

The core metrics

compute_metrics() derives 30+ values from each trade ledger. The ones you will read most often:

Metric	What it measures	Read it as
Sharpe (annualized)	Risk-adjusted return, scaled to a year	Higher is steadier; treat with care on short windows (see flags)
Sortino	Like Sharpe but penalizes only downside volatility	Higher rewards strategies that are choppy on the upside, smooth on the downside
Max drawdown (%)	Largest peak-to-trough equity decline	Lower is calmer; the pain you would have had to sit through
Profit factor	Gross profit ÷ gross loss	Above 1.0 means wins outweighed losses (see infinite-PF caveat below)
Win rate	Fraction of trades that closed positive	Pairs with payoff; a low win rate can still be fine
Payoff / avg win vs avg loss	Average winner size relative to average loser	High payoff offsets a low win rate
Monthly / annualized return	Return scaled to a month or year	Illustrative only; suppressed on short windows
Avg bars held	Mean trade duration in bars	Tells you the holding horizon the metrics describe
Gross profit / gross loss	Raw totals before netting	The components behind profit factor

These are computed identically for the IS and OOS blocks, so you can compare the same strategy across the train/test split directly.

Breakdowns

Two slices help you see where performance came from:

by_side — long vs short. A strategy that only makes money on one side is really half a strategy.
by_regime — performance split across the four market regimes (trend_up, trend_down, range_bound, high_vol). This shows whether an edge is regime-specific or broad-based.

Robustness and walk-forward

A single 70/30 split is one test. The gauntlet goes further with walk-forward analysis (WFA) — N-fold cross-validation that re-runs the IS/OOS split across multiple windows.

Two robustness figures matter:

Robustness score — defined as 1.0 - max(IS→OOS Sharpe degradation, 0). A high score means performance held up from training to test; a low score means the edge decayed out-of-sample.
Degradation % — 1 - (avg_oos_sharpe / avg_is_sharpe) aggregated across folds. The WFA verdict is PASS when degradation is below 50% and there are at least 5 OOS trades. Degradation above 50% is the signature of overfitting.

Defaults: 5 folds, 70% IS / 30% OOS per fold (walkforward_folds, walkforward_train_ratio). WFA needs at least 420 bars of data; each fold needs enough OOS bars and warmup to be meaningful. Adequacy warnings are logged when an OOS window is shorter than 30 days or a fold has too few bars — the numbers still compute, but treat them as weak.

Reliability flags — read these first

Short windows produce metrics that look spectacular and mean nothing. Forven attaches boolean flags so the gates and the UI can suppress numbers that aren't yet trustworthy.

Flag	True when	Effect
`sharpe_is_reliable`	At least 20 trades	Sharpe is suppressed in displays/gates when false
`annualized_return_reliable`	At least 3 months of data	Annualized return suppressed when false
`profit_factor_is_infinite`	A strategy had wins and zero losses	PF returned as infinity; downstream must handle it explicitly
`funding_applied`	Any trade had funding costs applied	Provenance — were funding costs in the picture at all
`funding_complete`	All trades had complete funding data	Gates may reject a backtest run in a funding-blind window

Two reasons these exist:

Sharpe annualization scales by sqrt(trades_per_year), which inflates wildly on short runs. Under 20 trades, the figure is flagged unreliable.
Annualized return is capped at a 3-month window for display. A 25-day run can report a return in the thousands of percent — that is an artifact, not a result. The flag suppresses the display; it does not change the underlying fitness math.

When a flag is false, the number is hidden rather than shown with a false air of confidence. That is deliberate: transparency over confidence.

Funding-cost provenance

Forven applies HyperLiquid funding costs to backtest trades by default (backtest_include_funding). Because a funding-blind backtest can look better than reality, two flags record what actually happened:

funding_applied — at least one trade carried a funding cost.
funding_complete — every trade had complete funding history.

If funding_complete is false, the promotion gates may reject the run rather than promote a strategy validated in a window with missing funding data. On a fresh install the first backtest auto-backfills funding history from the exchange, so this self-heals over time.

How fees, slippage, and leverage enter the numbers

Metrics already include trading costs. The defaults applied to each round-trip:

Cost	Config key	Default
Fee	`backtest_fee_bps`	4.5 bps
Slippage	`backtest_slippage_bps`	2.0 bps
Leverage	(per-backtest)	3x

Fees and slippage are applied as a round-trip cost (entry + exit). This means a strategy's metrics reflect a return after costs — there is no separate "gross vs net" toggle to forget.

How the gates read metrics

The promotion gates compute a single fitness score (0–100) from five weighted factors, all read from the OOS metrics:

fitness = 30% Sharpe (capped at 3.0)
        + 20% win rate
        + 20% profit factor (capped at 5.0)
        + 15% max drawdown (penalty: 0 at 10%+ drawdown)
        + 15% trade-count bonus (min 20 trades)

The thresholds that move a strategy along the pipeline:

>= 60 — eligible for paper (the public "candidate" stage)
>= 70 — eligible for deploy / live
< 40 — eligible for retirement

A metrics integrity check (check_metrics_integrity()) runs before any promotion. Impossible ratios, NaN, and other anomalies quarantine the result and block promotion — a strategy cannot be promoted on numbers that don't add up.

Reading a result via the API

You can pull the raw metrics directly. Submit a run and read the result:

# Submit a backtest
curl.exe -X POST http://127.0.0.1:8003/api/backtesting/run `
  -H "x-api-key: $env:FORVEN_API_KEY" `
  -H "content-type: application/json" `
  -d '{ "strategy_type": "rsi_momentum", "asset": "BTC", "timeframe": "1h" }'

# Retrieve the full ledger + metrics
curl.exe http://127.0.0.1:8003/api/backtesting/results/<result_id> `
  -H "x-api-key: $env:FORVEN_API_KEY"

The response carries the in_sample, out_of_sample, and robustness blocks described above. In the UI, the same data renders on the /backtest/{id} page as a chart, trades table, and metrics panel, with regime shadings on the chart.

Caveats

Trust OOS, not IS. In-sample metrics are tuning artifacts. The gates only read out-of-sample, and so should you.
Honor the flags. A great Sharpe under 20 trades or a four-digit annualized return on a three-week window is noise, not edge.
Infinite profit factor is real. A strategy with zero losses reports profit_factor = inf plus profit_factor_is_infinite. It usually means too few trades, not a perfect strategy.
Costs are baked in. Fees, slippage, and funding are already in the metrics; don't double-count them.
Past is not prologue. Every number here describes historical behaviour on historical data. It is illustrative, not predictive, and nothing here is financial advice.

The gauntlet — the full robustness battery these metrics feed
Backtesting a strategy — where these metrics are produced
Promotion gates — how fitness and reliability flags gate the pipeline
Market regimes — the regimes behind the by_regime breakdown

On this page