APPO

APPO is UniLab’s asynchronous PPO path. It uses scripts/train_appo.py, conf/appo/config.yaml, and the runtime under src/unilab/algos/torch/appo/. The config exposes algo.steps_per_env, training.collector_device, and training.replay_queue_size; the algorithm config includes V-trace clipping fields.

Quick Start

uv run train --algo appo --task go2_joystick_flat --sim mujoco
uv run train --algo appo --task g1_motion_tracking --sim motrix training.no_play=true

Common Overrides

uv run train --algo appo --task go2_joystick_flat --sim mujoco \
  algo.num_envs=2048 \
  algo.max_iterations=300 \
  training.replay_queue_size=2

Playback and checkpoint selection use uv run eval:

uv run eval --algo appo --task go2_joystick_flat --sim mujoco --load-run -1

Runtime Model

  • The collector runs CPU simulation while the learner runs GPU training.

  • Rollouts are published into a replay queue that the learner consumes.

  • APPO applies a V-trace importance-sampling correction, so its update semantics differ from synchronous PPO.

  • The collector/learner pipeline is backed by a 4-slot ring buffer.

Key Fields

  • algo.steps_per_env: rollout length per environment.

  • training.replay_queue_size: learner-side cache depth.

  • training.collector_device: collector device; defaults to following the learner.

  • algo.save_interval: checkpoint save interval.

The default log root is logs/appo/<task>/, from algo.algo_log_name=appo in conf/appo/config.yaml.