unilab.algos.torch.offpolicy

Off-policy RL unified infrastructure.

class unilab.algos.torch.offpolicy.OffPolicyLogger[source]

Bases: BaseTrainingLogger

Rich logger for off-policy RL algorithms (SAC, TD3, etc).

Parameters:
__init__(algo_name='RL', max_iterations=1500, num_envs=4096, env_name='', obs_dim=0, action_dim=0, refresh_per_second=4, log_dir='', log_backend='tensorboard', wandb_project='unilab', wandb_entity=None, wandb_name='', wandb_group=None, wandb_job_type=None, wandb_tags=None, wandb_notes=None)[source]
Parameters:
start(*, status='Warming up...')[source]
Parameters:

status (str)

finish(*, title='Training Summary', extra_summary='')[source]
Parameters:
  • title (str)

  • extra_summary (str)

log_buffer_fill(current, target)[source]
Parameters:
update_collector_timing(timing_ms)[source]
Parameters:

timing_ms (dict[str, float])

update_done_rates(timeout_rate, terminated_rate)[source]
Parameters:
update_buffer_utilization(utilization)[source]
Parameters:

utilization (float)

update_replay_queue(current_len, max_size)[source]
Parameters:
  • current_len (int)

  • max_size (int)

update_staging_pool(current_len, max_size)[source]
Parameters:
  • current_len (int)

  • max_size (int)

set_collection_sync(enabled, env_steps_per_sync=0)[source]
Parameters:
  • enabled (bool)

  • env_steps_per_sync (int)

log_collector(total_steps, buffer_size, mean_reward=0.0)[source]
Parameters:
  • total_steps (int)

  • buffer_size (int)

  • mean_reward (float)

log_step(iteration, metrics=None, reward=None, reward_metrics=None, reward_components=None, train_time=0.0, wait_time=0.0, learner_incremental_h2d_time=0.0, weight_sync_time=0.0, extra_info=None)[source]
Parameters:
log_status(status)[source]
Parameters:

status (str)

class unilab.algos.torch.offpolicy.OffPolicyRunner[source]

Bases: AsyncRunner

Unified runner for SAC and TD3.

Parameters:
  • env_name (str)

  • algo_type (str)

  • num_envs (int)

  • replay_buffer_n (int)

  • batch_size (int)

  • learning_starts (int)

  • updates_per_step (int)

  • policy_frequency (int)

  • sync_collection (bool)

  • env_steps_per_sync (int)

  • device (str | None)

  • actor_hidden_dim (int)

  • use_layer_norm (bool)

  • obs_normalization (bool)

  • sim_backend (str)

  • env_cfg_override (dict | None)

  • actor_kwargs (dict | None)

  • seed (int | None)

  • trace_enabled (bool)

  • trace_output_dir (str | None)

  • trace_thread_time (bool)

  • trace_cuda_events (bool)

__init__(learner, env_name, algo_type, num_envs=4096, replay_buffer_n=1024, batch_size=8192, learning_starts=0, updates_per_step=8, policy_frequency=4, sync_collection=True, env_steps_per_sync=1, device=None, actor_hidden_dim=512, use_layer_norm=True, obs_normalization=False, sim_backend='mujoco', env_cfg_override=None, actor_kwargs=None, seed=None, trace_enabled=False, trace_output_dir=None, trace_thread_time=False, trace_cuda_events=True)[source]
Parameters:
  • env_name (str)

  • algo_type (str)

  • num_envs (int)

  • replay_buffer_n (int)

  • batch_size (int)

  • learning_starts (int)

  • updates_per_step (int)

  • policy_frequency (int)

  • sync_collection (bool)

  • env_steps_per_sync (int)

  • device (str | None)

  • actor_hidden_dim (int)

  • use_layer_norm (bool)

  • obs_normalization (bool)

  • sim_backend (str)

  • env_cfg_override (dict | None)

  • actor_kwargs (dict | None)

  • seed (int | None)

  • trace_enabled (bool)

  • trace_output_dir (str | None)

  • trace_thread_time (bool)

  • trace_cuda_events (bool)

learn(max_iterations=1500, save_interval=50, log_dir='logs', logger_type='tensorboard')[source]

Unified training loop for off-policy algorithms.

Parameters:
  • max_iterations (int)

  • save_interval (int)

  • log_dir (str)

  • logger_type (str)

Return type:

None

close()[source]
Return type:

None

class unilab.algos.torch.offpolicy.MultiGPUOffPolicyRunner[source]

Bases: OffPolicyRunner

Multi-GPU off-policy runner.

Keeps a single Collector on CPU and spawns num_gpus Learner workers via torch.multiprocessing.spawn. Each worker processes independent mini-batches from the same shared ReplayBuffer; gradients are averaged with NCCL all_reduce — equivalent to training on a num_gpus× larger effective batch size per wall-clock second.

Falls back transparently to single-GPU when num_gpus <= 1.

Parameters:
static validate_capabilities(*, algo_type, learner_kwargs, num_gpus)[source]
Parameters:
Return type:

None

__init__(learner, env_name, algo_type, learner_kwargs, num_gpus=1, **kwargs)[source]
Parameters:
learn(max_iterations=1500, save_interval=50, log_dir='logs', logger_type='tensorboard')[source]

Unified training loop for off-policy algorithms.

Parameters:
  • max_iterations (int)

  • save_interval (int)

  • log_dir (str)

  • logger_type (str)

Return type:

None

unilab.algos.torch.offpolicy.off_policy_collector_fn(stop_event, env_name, num_envs, replay_buffer, weight_sync_name, weight_param_shapes, algo_type='sac', actor_hidden_dim=512, use_layer_norm=True, learning_starts=0, metrics_queue=None, weight_sync_lock=None, sync_collection=False, collection_ready_queue=None, trainer_done_queue=None, env_steps_per_sync=1, obs_normalization=False, shared_obs_normalizer_stats=None, sim_backend='mujoco', env_cfg_override=None, obs_dim=None, action_dim=None, actor_kwargs=None, seed=None, trace_enabled=False, trace_thread_time=False, collector_pack_request_queue=None, collector_pack_ready_queue=None, collector_pack_shared_slots=None, **kwargs)[source]

Entry point for the off-policy collector subprocess.

Error handling is provided by _collector_entry_wrapper in async_runner.py.

Parameters:

Modules

double_buffer_runner

Off-policy runner using CPU-pinned double-buffer replay pipeline (B path).

multi_gpu_runner

Multi-GPU off-policy runner using NCCL all-reduce for FastSAC.

runner

Unified runner for off-policy RL algorithms (SAC, TD3).

runtime

Runtime resolution helpers for off-policy script assembly.

worker

Off-policy collector for SAC and TD3.