Language

Runtime Model¶

The detailed runtime contract is in ADR-0001 Runtime Model And Layer Boundaries and Developer 指南. This page keeps the English summary close to the code paths.

Two Runtime Shapes¶

Synchronous PPO Paths¶

scripts/train_rsl_rl.py and scripts/train_mlx_ppo.py compose Hydra config, call registry bootstrap, construct the env through registry.make(...), and run the learner in the same process. The RSL-RL path adapts NpEnv through src/unilab/training/rsl_rl.py; the MLX path uses src/unilab/algos/mlx/ppo/runner.py and src/unilab/algos/mlx/ppo/ppo.py.

Async APPO And Off-Policy Paths¶

APPO and off-policy runners use a CPU-sim-to-learner split:

CPU physics env loop -> shared IPC buffer -> learner
        ^                                      |
        +------------- SharedWeightSync -------+

APPO uses APPORunner, RolloutRingBuffer, and SharedWeightSync.
SAC, TD3, and FlashSAC use off-policy runners with ReplayBuffer and SharedWeightSync.
AsyncRunner in src/unilab/ipc/async_runner.py owns collector process startup, stop signaling, and shared-resource cleanup.

Boundary Rules¶

The env remains numpy/vectorized and returns NpEnvState.
GPU tensors and optimizer state belong to learner code, not env code.
Collector/learner protocols must reuse the existing IPC primitives instead of creating ad-hoc parallel protocols in scripts.

Evidence In Repo¶

PPO entrypoints: scripts/train_rsl_rl.py, scripts/train_mlx_ppo.py
APPO runner: src/unilab/algos/torch/appo/runner.py
Off-policy runner: src/unilab/algos/torch/offpolicy/runner.py
IPC primitives: src/unilab/ipc/async_runner.py, src/unilab/ipc/rollout_ring_buffer.py, src/unilab/ipc/replay_buffer.py, src/unilab/ipc/weight_sync.py