Towards Financial World Modeling

A key input to decision making is a world model: a model which can be used to explore counter-factual outcomes and optimize ex ante decisions.

Markets are noisy, and that makes them a hard test for representation learning. We ask a direct question: can supervised and self-supervised methods that reshaped vision and language learn a useful world model of financial markets?

This is a first step, well short of that goal: a controlled test of whether modern representation learning recovers any useful structure from market data at all.

1.The Market-1T Dataset

~1T one-second aggregates, all US equities, 2008–2025.

Market-1T spans all equities traded on US markets from 2008 through 2025. Raw quotes and trades are aggregated into one-second intervals, roughly one trillion observations. The base data is mid-frequency; the same aggregations extend to minute, hourly, or daily scales.

Each one-second snapshot carries 11 features, bucketed into three groups:

Order book state: best bid/ask prices and sizes
Price summary: open, high, low, close, and VWAP
Activity: volume and trade count

Experiments focus on a liquid universe, re-calculated monthly from prior-month activity (Price > $5, ADV ≥ $20M).

2.Methodology

What we predict, how we crop a trading day, and how LeJEPA learns without labels.

2.1Tasks

The core tasks come from portfolio choice. To pick weights, an investor needs three forecasts: expected return, risk (volatility and covariances), and trading cost. We train models to predict each one, forward return, change in volatility, and change in bid-ask spread, at horizons from 5 minutes to 2 hours. These forecasts are not a trading strategy; they are the inputs a strategy would consume.

Return (primary evaluation): 5-way classification of 15-minute forward mid-price returns, one bin for zeros, the rest empirical quantiles.
Volatility change: how much conditional volatility moves, the risk term in the portfolio problem.
Spread change: how the bid-ask spread moves, a standard proxy for transaction cost.

Supervised models train on these labels directly, one task at a time or all three at once. SSL models see no labels; we read their representations afterward with linear probes on the same targets. We also probe for things finance does not usually label, namely ticker, time of day, and trading day within the month (plus a separate test on SEC-documented manipulation windows). These are nearly uncorrelated with return, volatility, and spread, so a supervised model has no reason to learn them. If a representation predicts them anyway, it is carrying structure beyond the task it was trained for.

2.2Scale and crop

Vision SSL leans on strong augmentation, and the workhorse is random scaling and cropping. We port it to markets. Sample a scale s and a random window of the asset-day, then resize to a fixed token length L. The catch is the resize: instead of interpolating pixels, we re-aggregate the way markets already do, last order book state, OHLC, VWAP, summed volume and trade count.

Global views cover broad intraday structure; local views zoom in on shorter horizons. They are sampled independently and may not overlap, so the encoder has to recognize the same day across different windows, resolutions, and sampling rates. Two global views at L = 2048 with s ∈ [0.5, 1.0]; local views at L = 512 with s ∈ [0.05, 0.5]. This augmentation is the default for LeJEPA, DINO, and BYOL.

2.3How LeJEPA works

LeJEPA is our main self-supervised baseline. It learns in representation space: rather than reconstruct raw prices, it predicts the latent of one view of an asset-day from another. The crop makes the views, the encoder embeds each one, and a predictor maps between them, with no return or volatility labels anywhere.

Asset-day1s aggregates · 11 channels

Scale & crop2 global + 6 local views

EncoderViT (default)

Predictorlatent → latent

No labels

The target is another view of the same day, not a return or volatility label.

SIGReg (λ = 0.01)

Spreads the embeddings out so the predictor can't cheat by collapsing everything to one point.

Prediction alone has a trivial solution: map everything to the same point. SIGReg (default λ = 0.01) blocks that collapse by keeping the embedding distribution spread out. The default run is a ViT backbone, learning rate 5×10⁻⁴, six local views, batch size 128. We then freeze the backbone and read it with linear probes on the tasks above, alongside supervised models that did see the labels. On the core tasks the gap is widest at the supervised head and narrows once both are read the same way, by a probe.

3.Evaluation

Strict temporal splits and ΔAUC over a random-init baseline keep the future out.

We adopt a strict temporal evaluation scheme. All hyperparameter selection and training happen on data prior to calendar time t; final models are evaluated strictly after t. Interleaving training and evaluation days introduces look-ahead bias and is disallowed.

Performance is reported as ΔAUC = AUC(trained) − AUC(random-init), averaging the random baseline over 10 initializations. Even a probe on random projections can beat chance.

Rolling multi-regime evaluation samples 32 training months and evaluates on the immediately following month. The primary task is 5-way classification of 15-minute forward mid-price returns, with one bin reserved for zero returns and the rest empirical quantile bins.

4.Results

Supervised heads lead on the named targets; SSL lags there but remembers structure those targets leave out — ticker identity, calendar, and the response to news.

We compare two families. Supervised models learn from labels, predicting future return, volatility change, or spread change from a past window. LeJEPA and the other self-supervised methods learn from views, aligning representations of global and local crops of the same asset-day with no labels.

We evaluate LeJEPA, MAE, CPC, DINO, BYOL, and supervised baselines on the core tasks: return, volatility change, and spread change. Supervised still wins. Across every backbone, the supervised head beats its self-supervised counterpart, and the gap shrinks, but does not close, once we read both with a probe instead of the head.

The tasks also help each other. A model trained on volatility or spread change transfers onto return prediction, and a single backbone with one head per task beats the return-only baseline. The spillover runs one way: into return, not out of it.

ΔAUC per backbone: self-supervised vs supervised

15-minute return, ΔAUC over random init (×100); supervised head on the x-axis, LeJEPA (λ=0.01) on the y-axis.

Below the dashed parity line, the supervised model wins. Against its trained head, every backbone sits well below the diagonal; the gap narrows once both families are read the same way (by a linear probe). Numbers are Table 1; bars are ±1 SE.

Task	Stronger head	Note
15-min return	Supervised	5-way return bins
Volatility change	Supervised	spills onto return
Spread change	Supervised	spills onto return
Multi-task backbone	Supervised	lifts the return baseline

Figure 1. Per-task summary. Supervised heads win every core target; SSL methods produce nontrivial representations but none close the gap to LeJEPA and supervised baselines on these.

4.1What the model remembers

SSL trails on the core tasks but recovers ticker, time-of-day, and calendar structure the return label never asked for.

A world model should know more than the next price move. It should know where it is: which asset it is watching, what kind of market state it is seeing, where in the calendar it sits. The figure below is a real t-SNE of market-window embeddings, six stocks from three sectors, one global-view crop per ticker-day, colored by ticker.

Under LeJEPA, the points sort into tight per-ticker islands, structure the return label never asked for. Under Supervised, the same six stocks collapse into a far more undifferentiated cloud. The islands are per-ticker, not per-sector: the two semis, two banks, and two oil majors do not merge. That points to each stock's microstructure fingerprint, not its industry.

Two-panel t-SNE of market-window embeddings colored by ticker. LeJEPA (left) forms six tight, well-separated per-ticker islands; the supervised return-classifier (right) is a single undifferentiated cloud. — Figure. Colored by ticker: **LeJEPA** recovers tight per-ticker islands; the *supervised* return-classifier collapses to a diffuse cloud.

t-SNE grid for six Tech stocks across June 2022, March 2023, and October 2023: LeJEPA (top row) forms per-ticker islands each month; supervised (bottom row) stays a cloud. — Figure. Colored by ticker: **LeJEPA** recovers tight per-ticker islands; the *supervised* return-classifier collapses to a diffuse cloud.

4.2When news hits the tape

8-K disclosures shift post-event volatility more than direction.

Prices are not the only thing that moves after news. Sometimes the bigger effect is uncertainty. We condition a frozen LeJEPA encoder on SEC 8-K filings and ask whether disclosure content improves predictions of the post-event market.

Pre-event2,048 s

8-K

Post-event2,048 s

frozen LeJEPA encoder → event-conditioned predictor → post-event latent

Post-event target	ΔAUC	Significance
Volatility change	+0.034	p < 10⁻³
Signed return	+0.010	p > 0.18 (n.s.)

Disclosure content shifts conditional variance more than conditional mean. The contrast survives a scrambled-text control.

The supervised model learns the label. LeJEPA remembers the world around the label.

5.What happens when the market changes?

A model that works in one month is not a market model. 2008–2025 stress-tests that.

A market model that only works in one month is not a market model. The evaluation has to survive different regimes — which is why Market-1T spans 2008 through 2025.

GFC2008

Recovery2012

Vol spikes2018

COVID2020

Rate shock2022

AI / 2020s2024

Marker fill encodes severity (filled = crisis, hollow = calm). Models are trained on month t and evaluated forward; performance is reported across the full series so no single favorable regime drives the result.

During the early-COVID shock, predictive ability shifts around the major turning points.

Technical detail

Models trained and probed on month t−12 are evaluated through early 2020, sampling one global view per day aggregated to a weekly mean. In contrast, early 2019 was a smooth, upward market. Predictive ability tends to change after pandemic turning points (WHO declaration, Fed QE, market drawdowns).

6.So, are we close to financial world models?

Supervised is better at the targets we name; SSL keeps the structure those targets leave out — within real limits.

Raw market dataquotes & trades

Market-1T1s aggregates

Scale & cropglobal + local views

EncoderViT / ResNet / …

Supervised

better at the targets finance already names: return, volatility, spread

LeJEPA

better at hidden structure: ticker, time of day, trading day

The result is not “self-supervision wins” or “supervised learning is enough.” It is more interesting: Supervised models are better at the targets finance already knows how to name, while LeJEPA preserves structure those targets leave out: asset identity, calendar structure, microstructure fingerprints.

Finance has good labels, but the market holds structure those labels do not name. Closing the gap on the core tasks while keeping that structure is the open problem.

Supervised models are better at answering the question we asked. LeJEPA is better at remembering the world it came from.

— Towards Financial World Modeling

6.1What this does not prove

High AUC is not profit. Single-asset only. Market representations can be misused.

Not a trading strategy. High AUC does not imply profit after execution, transaction costs, latency, sizing, and risk constraints.
Not a full market model. The study focuses on single-asset representations; cross-asset structure and portfolio interactions are unexplored.
Not risk-free to release. Market representations could be misused for manipulation or surveillance evasion. Current data is withheld.