ECCV 2026

Long-term Traffic Simulation via Structured Autoregressive Modeling

The University of Hong Kong

TL;DR

Jointly models motion and agent generation under a unified framework, and introduces a fair long-horizon simulation realism assessment.

Joint motion + generation LLM Unified Long-term Evaluation

Highlights

Representative long-horizon rollouts. Focus on stable traffic flow, natural insertions, and interaction consistency.

■ Ego vehicle
■ Initial agents
■ Spawned agents

Abstract

Interactive traffic simulation is a vital world model for autonomous driving. A central challenge in long-horizon simulation is modeling sustained multi-agent interactions, which is further exacerbated by dynamic token cardinality as agents continuously enter and exit the scene. In this work, we propose that the solution lies in the synergy between the architectural inductive biases and statistical priors of large-scale sequence models, such as LLMs. Our probing experiments reveal that the transferability of attention mechanisms and the distributional consistency between motion tokens and natural language enable small-scale, heavily frozen LLMs to rapidly adapt to traffic modeling. Building on this insight, we introduce RosettaSim, a unified framework that projects scene topology, agent states, and spawning intents into a structured autoregressive stream with variable length, achieving both strong short-term accuracy and stable long-horizon simulation fidelity. We further introduce Retrieval-based Traffic Evaluation (RTE), which retrieves semantically similar real-world scenarios as context-aware reference anchors. On WOSAC, RosettaSim achieves state-of-the-art performance in both short- and long-term simulation, while RTE exhibits a stronger correlation with standard metrics (r = 0.83) than existing long-term evaluation approaches (r = 0.74).

RosettaSim

RosettaSim mascot

RosettaSim uses LLM priors as a structural prior for long-range traffic modeling, then rolls out scenes with parallel motion updates and autoregressive agent generation.

Motivation

Why do LLM priors transferable to traffic?

Two empirical observations motivate RosettaSim: traffic motion tokens exhibit Zipf-like statistics similar to language, and even small frozen or heavily constrained LLMs adapt quickly to traffic motion modeling. Together, these results suggest that pretrained LLMs provide a useful sequence prior beyond language.

Zipf-like traffic token distribution from the paper

Method

Structured rollout with motion update and agent generation

We formulate long-term traffic simulation as structured sequence generation with two coupled processes: parallel motion generation for active agents and autoregressive agent generation for scene evolution. This keeps interaction reasoning and population dynamics aligned within one token stream throughout rollout.

RosettaSim pipeline

Retrieval-based Traffic Evaluation (RTE)

Long-horizon rollouts break one-to-one agent matching. RTE replaces context-blind global matching with semantically retrieved reference scenarios.

Problem

Why standard long-horizon metrics fail

After a few seconds, strict agent-level correspondence breaks down. Global histogram matching is also unfair because it forces local scenarios to resemble a dataset-wide average rather than a context-aware reference.

Why standard metrics fail

Mechanism

RTE retrieves scenario-specific anchors

RTE uses a pretrained scene-VAE latent space to retrieve semantically similar real-world scenarios, then evaluates rollouts against those retrieved anchors instead of against the whole validation distribution.

RTE pipeline

Validation

RTE aligns better with realism

Across 63 rollouts, RTE achieves stronger correlation with standard realism metrics than prior log-based long-horizon evaluation, supporting better alignment with actual simulation fidelity.

Correlation comparison for RTE

Simulation Results

Representative long-term rollouts from RosettaSim. Videos are shown at 6× speed for compact browsing.

How to read these results: focus on stable flow, natural insertions, and interaction consistency over extended horizons rather than short isolated maneuvers.

BibTeX

Update soon.