unilab.algos.torch.offpolicy.runner

Unified runner for off-policy RL algorithms (SAC, TD3).

Functions

build_reward_comparison_metrics(...)

Return the latest collector-side 100-episode mean for reward comparison.

compute_train_start_threshold(batch_size, ...)

Return the minimum replay size required before learner updates may start.

replay_buffer_ready_for_learning(...)

Whether the replay buffer has enough samples for the first learner step.

Classes

OffPolicyRunner

Unified runner for SAC and TD3.

unilab.algos.torch.offpolicy.runner.compute_train_start_threshold(batch_size, learning_starts, num_envs)[source]

Return the minimum replay size required before learner updates may start.

Parameters:
  • batch_size (int)

  • learning_starts (int)

  • num_envs (int)

Return type:

int

unilab.algos.torch.offpolicy.runner.replay_buffer_ready_for_learning(replay_buffer_size, *, batch_size, learning_starts, num_envs)[source]

Whether the replay buffer has enough samples for the first learner step.

Parameters:
  • replay_buffer_size (int)

  • batch_size (int)

  • learning_starts (int)

  • num_envs (int)

Return type:

bool

unilab.algos.torch.offpolicy.runner.build_reward_comparison_metrics(reward_history, smoothed_reward)[source]

Return the latest collector-side 100-episode mean for reward comparison.

Parameters:
Return type:

dict[str, float]

class unilab.algos.torch.offpolicy.runner.OffPolicyRunner[source]

Bases: AsyncRunner

Unified runner for SAC and TD3.

Parameters:
  • env_name (str)

  • algo_type (str)

  • num_envs (int)

  • replay_buffer_n (int)

  • batch_size (int)

  • learning_starts (int)

  • updates_per_step (int)

  • policy_frequency (int)

  • sync_collection (bool)

  • env_steps_per_sync (int)

  • device (str | None)

  • actor_hidden_dim (int)

  • use_layer_norm (bool)

  • obs_normalization (bool)

  • sim_backend (str)

  • env_cfg_override (dict | None)

  • actor_kwargs (dict | None)

  • seed (int | None)

  • trace_enabled (bool)

  • trace_output_dir (str | None)

  • trace_thread_time (bool)

  • trace_cuda_events (bool)

__init__(learner, env_name, algo_type, num_envs=4096, replay_buffer_n=1024, batch_size=8192, learning_starts=0, updates_per_step=8, policy_frequency=4, sync_collection=True, env_steps_per_sync=1, device=None, actor_hidden_dim=512, use_layer_norm=True, obs_normalization=False, sim_backend='mujoco', env_cfg_override=None, actor_kwargs=None, seed=None, trace_enabled=False, trace_output_dir=None, trace_thread_time=False, trace_cuda_events=True)[source]
Parameters:
  • env_name (str)

  • algo_type (str)

  • num_envs (int)

  • replay_buffer_n (int)

  • batch_size (int)

  • learning_starts (int)

  • updates_per_step (int)

  • policy_frequency (int)

  • sync_collection (bool)

  • env_steps_per_sync (int)

  • device (str | None)

  • actor_hidden_dim (int)

  • use_layer_norm (bool)

  • obs_normalization (bool)

  • sim_backend (str)

  • env_cfg_override (dict | None)

  • actor_kwargs (dict | None)

  • seed (int | None)

  • trace_enabled (bool)

  • trace_output_dir (str | None)

  • trace_thread_time (bool)

  • trace_cuda_events (bool)

learn(max_iterations=1500, save_interval=50, log_dir='logs', logger_type='tensorboard')[source]

Unified training loop for off-policy algorithms.

Parameters:
  • max_iterations (int)

  • save_interval (int)

  • log_dir (str)

  • logger_type (str)

Return type:

None

close()[source]
Return type:

None