unilab.algos.torch.offpolicy¶
Off-policy RL unified infrastructure.
- class unilab.algos.torch.offpolicy.OffPolicyLogger[source]¶
Bases:
BaseTrainingLoggerRich logger for off-policy RL algorithms (SAC, TD3, etc).
- Parameters:
- __init__(algo_name='RL', max_iterations=1500, num_envs=4096, env_name='', obs_dim=0, action_dim=0, refresh_per_second=4, log_dir='', log_backend='tensorboard', wandb_project='unilab', wandb_entity=None, wandb_name='', wandb_group=None, wandb_job_type=None, wandb_tags=None, wandb_notes=None)[source]¶
- class unilab.algos.torch.offpolicy.OffPolicyRunner[source]¶
Bases:
AsyncRunnerUnified runner for SAC and TD3.
- Parameters:
env_name (
str)algo_type (
str)num_envs (
int)replay_buffer_n (
int)batch_size (
int)learning_starts (
int)updates_per_step (
int)policy_frequency (
int)sync_collection (
bool)env_steps_per_sync (
int)actor_hidden_dim (
int)use_layer_norm (
bool)obs_normalization (
bool)sim_backend (
str)trace_enabled (
bool)trace_thread_time (
bool)trace_cuda_events (
bool)
- __init__(learner, env_name, algo_type, num_envs=4096, replay_buffer_n=1024, batch_size=8192, learning_starts=0, updates_per_step=8, policy_frequency=4, sync_collection=True, env_steps_per_sync=1, device=None, actor_hidden_dim=512, use_layer_norm=True, obs_normalization=False, sim_backend='mujoco', env_cfg_override=None, actor_kwargs=None, seed=None, trace_enabled=False, trace_output_dir=None, trace_thread_time=False, trace_cuda_events=True)[source]¶
- Parameters:
env_name (
str)algo_type (
str)num_envs (
int)replay_buffer_n (
int)batch_size (
int)learning_starts (
int)updates_per_step (
int)policy_frequency (
int)sync_collection (
bool)env_steps_per_sync (
int)actor_hidden_dim (
int)use_layer_norm (
bool)obs_normalization (
bool)sim_backend (
str)trace_enabled (
bool)trace_thread_time (
bool)trace_cuda_events (
bool)
- class unilab.algos.torch.offpolicy.MultiGPUOffPolicyRunner[source]¶
Bases:
OffPolicyRunnerMulti-GPU off-policy runner.
Keeps a single Collector on CPU and spawns num_gpus Learner workers via
torch.multiprocessing.spawn. Each worker processes independent mini-batches from the same shared ReplayBuffer; gradients are averaged with NCCL all_reduce — equivalent to training on a num_gpus× larger effective batch size per wall-clock second.Falls back transparently to single-GPU when
num_gpus <= 1.- Parameters:
- unilab.algos.torch.offpolicy.off_policy_collector_fn(stop_event, env_name, num_envs, replay_buffer, weight_sync_name, weight_param_shapes, algo_type='sac', actor_hidden_dim=512, use_layer_norm=True, learning_starts=0, metrics_queue=None, weight_sync_lock=None, sync_collection=False, collection_ready_queue=None, trainer_done_queue=None, env_steps_per_sync=1, obs_normalization=False, shared_obs_normalizer_stats=None, sim_backend='mujoco', env_cfg_override=None, obs_dim=None, action_dim=None, actor_kwargs=None, seed=None, trace_enabled=False, trace_thread_time=False, collector_pack_request_queue=None, collector_pack_ready_queue=None, collector_pack_shared_slots=None, **kwargs)[source]¶
Entry point for the off-policy collector subprocess.
Error handling is provided by
_collector_entry_wrapperinasync_runner.py.- Parameters:
Modules
Off-policy runner using CPU-pinned double-buffer replay pipeline (B path). |
|
Multi-GPU off-policy runner using NCCL all-reduce for FastSAC. |
|
Unified runner for off-policy RL algorithms (SAC, TD3). |
|
Runtime resolution helpers for off-policy script assembly. |
|
Off-policy collector for SAC and TD3. |