Top-level modules

Small leaf modules that don’t fit elsewhere.

unilab.cli

Thin package CLI for routing to existing UniLab training entrypoints.

class unilab.cli.Route[source]

Bases: object

Route(script_name: ‘str’, config_group: ‘str’, owner_task: ‘str’, generated_overrides: ‘tuple[str, …]’)

Parameters:
script_name: str
config_group: str
owner_task: str
generated_overrides: tuple[str, ...]
__init__(script_name, config_group, owner_task, generated_overrides)
Parameters:
unilab.cli.repo_root()[source]
Return type:

Path

unilab.cli.build_route(algo, task, sim, profile=None)[source]
Parameters:
Return type:

Route

unilab.cli.build_command(*, mode, algo, task, sim, overrides, profile=None, load_run=None, render_mode=None, root=None)[source]
Parameters:
Return type:

list[str]

unilab.cli.train_main(argv=None)[source]
Parameters:

argv (Optional[Sequence[str]])

Return type:

int

unilab.cli.eval_main(argv=None)[source]
Parameters:

argv (Optional[Sequence[str]])

Return type:

int

unilab.cli.demo_main(argv=None)[source]
Parameters:

argv (Optional[Sequence[str]])

Return type:

int

unilab.demo

Demo entrypoint: fetch checkpoint from HF, then launch interactive playback.

class unilab.demo.DemoSpec[source]

Bases: object

DemoSpec(algo: ‘str’, task: ‘str’, sim: ‘str’, entry: ‘str’)

Parameters:
algo: str
task: str
sim: str
entry: str
__init__(algo, task, sim, entry)
Parameters:
unilab.demo.get_demo_spec(demo_name)[source]
Parameters:

demo_name (str)

Return type:

DemoSpec

unilab.demo.build_demo_command(*, demo_name, checkpoint_path, device=None, root=None)[source]

Assemble the subprocess command for a given demo + resolved checkpoint.

Parameters:
Return type:

list[str]

unilab.demo.run_demo(*, demo_name, refresh=False, device=None)[source]
Parameters:
Return type:

int

unilab.structured_configs

Typed dataclass configs for all training algorithms.

Replaces ml_collections.ConfigDict factory functions. Use OmegaConf / Hydra to compose these at runtime.

class unilab.structured_configs.BaseConfig[source]

Bases: object

to_dict()[source]
Return type:

dict[str, Any]

class unilab.structured_configs.SACAlgoParams[source]

Bases: object

SACAlgoParams(alpha_lr: ‘float’ = 0.0003, alpha_init: ‘float’ = 0.01, target_entropy_ratio: ‘float’ = 0.0, max_grad_norm: ‘float’ = 0.0, amp_dtype: ‘str’ = ‘auto’, use_compile: ‘bool’ = True)

Parameters:
alpha_lr: float = 0.0003
alpha_init: float = 0.01
target_entropy_ratio: float = 0.0
max_grad_norm: float = 0.0
amp_dtype: str = 'auto'
use_compile: bool = True
__init__(alpha_lr=0.0003, alpha_init=0.01, target_entropy_ratio=0.0, max_grad_norm=0.0, amp_dtype='auto', use_compile=True)
Parameters:
class unilab.structured_configs.SACConfig[source]

Bases: BaseConfig

SACConfig(algo: ‘str’ = ‘sac’, algo_log_name: ‘str’ = ‘fast_sac’, runtime_impl: ‘Optional[str]’ = None, runtime_resolver: ‘Optional[str]’ = None, seed: ‘int’ = 1, num_envs: ‘int’ = 4096, batch_size: ‘int’ = 8192, replay_buffer_n: ‘int’ = 512, updates_per_step: ‘int’ = 4, learning_starts: ‘int’ = 1, policy_frequency: ‘int’ = 4, env_steps_per_sync: ‘int’ = 1, max_iterations: ‘int’ = 500, save_interval: ‘int’ = 500, gamma: ‘float’ = 0.97, tau: ‘float’ = 0.125, actor_lr: ‘float’ = 0.0003, critic_lr: ‘float’ = 0.0003, actor_hidden_dim: ‘int’ = 512, critic_hidden_dim: ‘int’ = 768, num_atoms: ‘int’ = 101, obs_normalization: ‘bool’ = True, use_layer_norm: ‘bool’ = True, use_symmetry: ‘bool’ = False, actor: ‘dict[str, Any]’ = <factory>, algo_params: ‘SACAlgoParams’ = <factory>)

Parameters:
algo: str = 'sac'
algo_log_name: str = 'fast_sac'
runtime_impl: Optional[str] = None
runtime_resolver: Optional[str] = None
seed: int = 1
num_envs: int = 4096
batch_size: int = 8192
replay_buffer_n: int = 512
updates_per_step: int = 4
learning_starts: int = 1
policy_frequency: int = 4
env_steps_per_sync: int = 1
max_iterations: int = 500
save_interval: int = 500
gamma: float = 0.97
tau: float = 0.125
actor_lr: float = 0.0003
critic_lr: float = 0.0003
actor_hidden_dim: int = 512
critic_hidden_dim: int = 768
num_atoms: int = 101
obs_normalization: bool = True
use_layer_norm: bool = True
use_symmetry: bool = False
actor: dict[str, Any]
algo_params: SACAlgoParams
__init__(algo='sac', algo_log_name='fast_sac', runtime_impl=None, runtime_resolver=None, seed=1, num_envs=4096, batch_size=8192, replay_buffer_n=512, updates_per_step=4, learning_starts=1, policy_frequency=4, env_steps_per_sync=1, max_iterations=500, save_interval=500, gamma=0.97, tau=0.125, actor_lr=0.0003, critic_lr=0.0003, actor_hidden_dim=512, critic_hidden_dim=768, num_atoms=101, obs_normalization=True, use_layer_norm=True, use_symmetry=False, actor=<factory>, algo_params=<factory>)
Parameters:
class unilab.structured_configs.TD3AlgoParams[source]

Bases: object

TD3AlgoParams(weight_decay: ‘float’ = 0.1, v_min: ‘float’ = -10.0, v_max: ‘float’ = 10.0, init_scale: ‘float’ = 0.01, log_std_min: ‘float’ = -0.9, log_std_max: ‘float’ = 0.0, policy_noise: ‘float’ = 0.2, noise_clip: ‘float’ = 0.5, use_cdq: ‘bool’ = True)

Parameters:
weight_decay: float = 0.1
v_min: float = -10.0
v_max: float = 10.0
init_scale: float = 0.01
log_std_min: float = -0.9
log_std_max: float = 0.0
policy_noise: float = 0.2
noise_clip: float = 0.5
use_cdq: bool = True
__init__(weight_decay=0.1, v_min=-10.0, v_max=10.0, init_scale=0.01, log_std_min=-0.9, log_std_max=0.0, policy_noise=0.2, noise_clip=0.5, use_cdq=True)
Parameters:
class unilab.structured_configs.TD3Config[source]

Bases: BaseConfig

TD3Config(algo: ‘str’ = ‘td3’, algo_log_name: ‘str’ = ‘fast_td3’, seed: ‘int’ = 1, num_envs: ‘int’ = 4096, batch_size: ‘int’ = 8192, replay_buffer_n: ‘int’ = 1000, updates_per_step: ‘int’ = 4, learning_starts: ‘int’ = 1, policy_frequency: ‘int’ = 2, env_steps_per_sync: ‘int’ = 1, max_iterations: ‘int’ = 5000, save_interval: ‘int’ = 500, gamma: ‘float’ = 0.97, tau: ‘float’ = 0.1, actor_lr: ‘float’ = 0.0003, critic_lr: ‘float’ = 0.0003, actor_hidden_dim: ‘int’ = 256, critic_hidden_dim: ‘int’ = 512, num_atoms: ‘int’ = 101, obs_normalization: ‘bool’ = True, use_layer_norm: ‘bool’ = False, algo_params: ‘TD3AlgoParams’ = <factory>)

Parameters:
  • algo (str)

  • algo_log_name (str)

  • seed (int)

  • num_envs (int)

  • batch_size (int)

  • replay_buffer_n (int)

  • updates_per_step (int)

  • learning_starts (int)

  • policy_frequency (int)

  • env_steps_per_sync (int)

  • max_iterations (int)

  • save_interval (int)

  • gamma (float)

  • tau (float)

  • actor_lr (float)

  • critic_lr (float)

  • actor_hidden_dim (int)

  • critic_hidden_dim (int)

  • num_atoms (int)

  • obs_normalization (bool)

  • use_layer_norm (bool)

  • algo_params (TD3AlgoParams)

algo: str = 'td3'
algo_log_name: str = 'fast_td3'
seed: int = 1
num_envs: int = 4096
batch_size: int = 8192
replay_buffer_n: int = 1000
updates_per_step: int = 4
learning_starts: int = 1
policy_frequency: int = 2
env_steps_per_sync: int = 1
max_iterations: int = 5000
save_interval: int = 500
gamma: float = 0.97
tau: float = 0.1
actor_lr: float = 0.0003
critic_lr: float = 0.0003
actor_hidden_dim: int = 256
critic_hidden_dim: int = 512
num_atoms: int = 101
obs_normalization: bool = True
use_layer_norm: bool = False
algo_params: TD3AlgoParams
__init__(algo='td3', algo_log_name='fast_td3', seed=1, num_envs=4096, batch_size=8192, replay_buffer_n=1000, updates_per_step=4, learning_starts=1, policy_frequency=2, env_steps_per_sync=1, max_iterations=5000, save_interval=500, gamma=0.97, tau=0.1, actor_lr=0.0003, critic_lr=0.0003, actor_hidden_dim=256, critic_hidden_dim=512, num_atoms=101, obs_normalization=True, use_layer_norm=False, algo_params=<factory>)
Parameters:
  • algo (str)

  • algo_log_name (str)

  • seed (int)

  • num_envs (int)

  • batch_size (int)

  • replay_buffer_n (int)

  • updates_per_step (int)

  • learning_starts (int)

  • policy_frequency (int)

  • env_steps_per_sync (int)

  • max_iterations (int)

  • save_interval (int)

  • gamma (float)

  • tau (float)

  • actor_lr (float)

  • critic_lr (float)

  • actor_hidden_dim (int)

  • critic_hidden_dim (int)

  • num_atoms (int)

  • obs_normalization (bool)

  • use_layer_norm (bool)

  • algo_params (TD3AlgoParams)

class unilab.structured_configs.FlashSACAlgoParams[source]

Bases: object

FlashSACAlgoParams(normalize_reward: ‘bool’ = True, normalized_g_max: ‘float’ = 5.0, actor_num_blocks: ‘int’ = 2, critic_num_blocks: ‘int’ = 2, actor_bc_alpha: ‘float’ = 0.0, actor_noise_zeta_mu: ‘float’ = 2.0, actor_noise_zeta_max: ‘int’ = 16, critic_min_v: ‘float’ = -5.0, critic_max_v: ‘float’ = 5.0, temp_initial_value: ‘float’ = 0.01, temp_target_sigma: ‘float’ = 0.15, temp_target_entropy: ‘float | None’ = None, learning_rate_init: ‘float’ = 0.0003, learning_rate_peak: ‘float’ = 0.0003, learning_rate_end: ‘float’ = 0.00015, learning_rate_warmup_steps: ‘int’ = 0, learning_rate_decay_steps: ‘int’ = 500000, n_step: ‘int’ = 1, amp_dtype: ‘str’ = ‘auto’, use_compile: ‘bool’ = True)

Parameters:
  • normalize_reward (bool)

  • normalized_g_max (float)

  • actor_num_blocks (int)

  • critic_num_blocks (int)

  • actor_bc_alpha (float)

  • actor_noise_zeta_mu (float)

  • actor_noise_zeta_max (int)

  • critic_min_v (float)

  • critic_max_v (float)

  • temp_initial_value (float)

  • temp_target_sigma (float)

  • temp_target_entropy (float | None)

  • learning_rate_init (float)

  • learning_rate_peak (float)

  • learning_rate_end (float)

  • learning_rate_warmup_steps (int)

  • learning_rate_decay_steps (int)

  • n_step (int)

  • amp_dtype (str)

  • use_compile (bool)

normalize_reward: bool = True
normalized_g_max: float = 5.0
actor_num_blocks: int = 2
critic_num_blocks: int = 2
actor_bc_alpha: float = 0.0
actor_noise_zeta_mu: float = 2.0
actor_noise_zeta_max: int = 16
critic_min_v: float = -5.0
critic_max_v: float = 5.0
temp_initial_value: float = 0.01
temp_target_sigma: float = 0.15
temp_target_entropy: float | None = None
learning_rate_init: float = 0.0003
learning_rate_peak: float = 0.0003
learning_rate_end: float = 0.00015
learning_rate_warmup_steps: int = 0
learning_rate_decay_steps: int = 500000
n_step: int = 1
amp_dtype: str = 'auto'
use_compile: bool = True
__init__(normalize_reward=True, normalized_g_max=5.0, actor_num_blocks=2, critic_num_blocks=2, actor_bc_alpha=0.0, actor_noise_zeta_mu=2.0, actor_noise_zeta_max=16, critic_min_v=-5.0, critic_max_v=5.0, temp_initial_value=0.01, temp_target_sigma=0.15, temp_target_entropy=None, learning_rate_init=0.0003, learning_rate_peak=0.0003, learning_rate_end=0.00015, learning_rate_warmup_steps=0, learning_rate_decay_steps=500000, n_step=1, amp_dtype='auto', use_compile=True)
Parameters:
  • normalize_reward (bool)

  • normalized_g_max (float)

  • actor_num_blocks (int)

  • critic_num_blocks (int)

  • actor_bc_alpha (float)

  • actor_noise_zeta_mu (float)

  • actor_noise_zeta_max (int)

  • critic_min_v (float)

  • critic_max_v (float)

  • temp_initial_value (float)

  • temp_target_sigma (float)

  • temp_target_entropy (float | None)

  • learning_rate_init (float)

  • learning_rate_peak (float)

  • learning_rate_end (float)

  • learning_rate_warmup_steps (int)

  • learning_rate_decay_steps (int)

  • n_step (int)

  • amp_dtype (str)

  • use_compile (bool)

class unilab.structured_configs.FlashSACConfig[source]

Bases: BaseConfig

FlashSACConfig(algo: ‘str’ = ‘flashsac’, algo_log_name: ‘str’ = ‘flash_sac’, seed: ‘int’ = 1, num_envs: ‘int’ = 1024, batch_size: ‘int’ = 2048, replay_buffer_n: ‘int’ = 512, updates_per_step: ‘int’ = 2, learning_starts: ‘int’ = 98, policy_frequency: ‘int’ = 2, env_steps_per_sync: ‘int’ = 1, max_iterations: ‘int’ = 5000, save_interval: ‘int’ = 1000, gamma: ‘float’ = 0.97, tau: ‘float’ = 0.01, actor_lr: ‘float’ = 0.0003, critic_lr: ‘float’ = 0.0003, actor_hidden_dim: ‘int’ = 128, critic_hidden_dim: ‘int’ = 256, num_atoms: ‘int’ = 101, obs_normalization: ‘bool’ = False, use_layer_norm: ‘bool’ = False, algo_params: ‘FlashSACAlgoParams’ = <factory>)

Parameters:
algo: str = 'flashsac'
algo_log_name: str = 'flash_sac'
seed: int = 1
num_envs: int = 1024
batch_size: int = 2048
replay_buffer_n: int = 512
updates_per_step: int = 2
learning_starts: int = 98
policy_frequency: int = 2
env_steps_per_sync: int = 1
max_iterations: int = 5000
save_interval: int = 1000
gamma: float = 0.97
tau: float = 0.01
actor_lr: float = 0.0003
critic_lr: float = 0.0003
actor_hidden_dim: int = 128
critic_hidden_dim: int = 256
num_atoms: int = 101
obs_normalization: bool = False
use_layer_norm: bool = False
algo_params: FlashSACAlgoParams
__init__(algo='flashsac', algo_log_name='flash_sac', seed=1, num_envs=1024, batch_size=2048, replay_buffer_n=512, updates_per_step=2, learning_starts=98, policy_frequency=2, env_steps_per_sync=1, max_iterations=5000, save_interval=1000, gamma=0.97, tau=0.01, actor_lr=0.0003, critic_lr=0.0003, actor_hidden_dim=128, critic_hidden_dim=256, num_atoms=101, obs_normalization=False, use_layer_norm=False, algo_params=<factory>)
Parameters:
class unilab.structured_configs.APPOAlgorithmConfig[source]

Bases: object

APPOAlgorithmConfig(num_learning_epochs: ‘int’ = 5, num_mini_batches: ‘int’ = 4, clip_param: ‘float’ = 0.2, gamma: ‘float’ = 0.99, lam: ‘float’ = 0.95, value_loss_coef: ‘float’ = 1.0, entropy_coef: ‘float’ = 0.01, learning_rate: ‘float’ = 0.001, max_grad_norm: ‘float’ = 1.0, use_clipped_value_loss: ‘bool’ = True, schedule: ‘str’ = ‘adaptive’, desired_kl: ‘float’ = 0.01, adaptive_kl_factor: ‘float’ = 1.2, adaptive_lr_factor: ‘float’ = 1.1, optimizer: ‘str’ = ‘adam’, tau: ‘float’ = 1.0, target_update_freq: ‘int’ = 1, vtrace_clip_rho: ‘float’ = 1.0, vtrace_clip_c: ‘float’ = 1.0, enable_compile: ‘bool’ = True)

Parameters:
num_learning_epochs: int = 5
num_mini_batches: int = 4
clip_param: float = 0.2
gamma: float = 0.99
lam: float = 0.95
value_loss_coef: float = 1.0
entropy_coef: float = 0.01
learning_rate: float = 0.001
max_grad_norm: float = 1.0
use_clipped_value_loss: bool = True
schedule: str = 'adaptive'
desired_kl: float = 0.01
adaptive_kl_factor: float = 1.2
adaptive_lr_factor: float = 1.1
optimizer: str = 'adam'
tau: float = 1.0
target_update_freq: int = 1
vtrace_clip_rho: float = 1.0
vtrace_clip_c: float = 1.0
enable_compile: bool = True
__init__(num_learning_epochs=5, num_mini_batches=4, clip_param=0.2, gamma=0.99, lam=0.95, value_loss_coef=1.0, entropy_coef=0.01, learning_rate=0.001, max_grad_norm=1.0, use_clipped_value_loss=True, schedule='adaptive', desired_kl=0.01, adaptive_kl_factor=1.2, adaptive_lr_factor=1.1, optimizer='adam', tau=1.0, target_update_freq=1, vtrace_clip_rho=1.0, vtrace_clip_c=1.0, enable_compile=True)
Parameters:
class unilab.structured_configs.APPODistributionConfig[source]

Bases: object

APPODistributionConfig(class_name: ‘str’ = ‘rsl_rl.modules.distribution.GaussianDistribution’, init_std: ‘float’ = 1.0, std_type: ‘str’ = ‘scalar’)

Parameters:
class_name: str = 'rsl_rl.modules.distribution.GaussianDistribution'
init_std: float = 1.0
std_type: str = 'scalar'
__init__(class_name='rsl_rl.modules.distribution.GaussianDistribution', init_std=1.0, std_type='scalar')
Parameters:
class unilab.structured_configs.APPOActorConfig[source]

Bases: object

APPOActorConfig(class_name: ‘str’ = ‘rsl_rl.models.MLPModel’, hidden_dims: ‘list’ = <factory>, activation: ‘str’ = ‘elu’, distribution_cfg: ‘APPODistributionConfig’ = <factory>)

Parameters:
class_name: str = 'rsl_rl.models.MLPModel'
hidden_dims: list
activation: str = 'elu'
distribution_cfg: APPODistributionConfig
__init__(class_name='rsl_rl.models.MLPModel', hidden_dims=<factory>, activation='elu', distribution_cfg=<factory>)
Parameters:
class unilab.structured_configs.APPOCriticConfig[source]

Bases: object

APPOCriticConfig(class_name: ‘str’ = ‘rsl_rl.models.MLPModel’, hidden_dims: ‘list’ = <factory>, activation: ‘str’ = ‘elu’)

Parameters:
  • class_name (str)

  • hidden_dims (list)

  • activation (str)

class_name: str = 'rsl_rl.models.MLPModel'
hidden_dims: list
activation: str = 'elu'
__init__(class_name='rsl_rl.models.MLPModel', hidden_dims=<factory>, activation='elu')
Parameters:
  • class_name (str)

  • hidden_dims (list)

  • activation (str)

class unilab.structured_configs.APPOConfig[source]

Bases: BaseConfig

APPOConfig(algo: ‘str’ = ‘appo’, algo_log_name: ‘str’ = ‘appo’, seed: ‘int’ = 1, num_envs: ‘int’ = 2048, steps_per_env: ‘int’ = 24, max_iterations: ‘int’ = 150, save_interval: ‘int’ = 50, obs_groups: ‘dict’ = <factory>, actor: ‘APPOActorConfig’ = <factory>, critic: ‘APPOCriticConfig’ = <factory>, algorithm: ‘APPOAlgorithmConfig’ = <factory>)

Parameters:
algo: str = 'appo'
algo_log_name: str = 'appo'
seed: int = 1
num_envs: int = 2048
steps_per_env: int = 24
max_iterations: int = 150
save_interval: int = 50
obs_groups: dict
actor: APPOActorConfig
critic: APPOCriticConfig
algorithm: APPOAlgorithmConfig
__init__(algo='appo', algo_log_name='appo', seed=1, num_envs=2048, steps_per_env=24, max_iterations=150, save_interval=50, obs_groups=<factory>, actor=<factory>, critic=<factory>, algorithm=<factory>)
Parameters:
class unilab.structured_configs.PPOPolicyConfig[source]

Bases: object

PPOPolicyConfig(init_noise_std: ‘float’ = 1.0, actor_hidden_dims: ‘list’ = <factory>, critic_hidden_dims: ‘list’ = <factory>, activation: ‘str’ = ‘elu’, class_name: ‘str’ = ‘ActorCritic’)

Parameters:
  • init_noise_std (float)

  • actor_hidden_dims (list)

  • critic_hidden_dims (list)

  • activation (str)

  • class_name (str)

init_noise_std: float = 1.0
actor_hidden_dims: list
critic_hidden_dims: list
activation: str = 'elu'
class_name: str = 'ActorCritic'
__init__(init_noise_std=1.0, actor_hidden_dims=<factory>, critic_hidden_dims=<factory>, activation='elu', class_name='ActorCritic')
Parameters:
  • init_noise_std (float)

  • actor_hidden_dims (list)

  • critic_hidden_dims (list)

  • activation (str)

  • class_name (str)

class unilab.structured_configs.PPOAlgorithmConfig[source]

Bases: object

PPOAlgorithmConfig(class_name: ‘str’ = ‘unilab.algos.torch.rsl_rl_ppo:FinalObservationAwarePPO’, value_loss_coef: ‘float’ = 1.0, use_clipped_value_loss: ‘bool’ = True, clip_param: ‘float’ = 0.2, entropy_coef: ‘float’ = 0.01, num_learning_epochs: ‘int’ = 5, num_mini_batches: ‘int’ = 4, learning_rate: ‘float’ = 0.001, schedule: ‘str’ = ‘adaptive’, gamma: ‘float’ = 0.99, lam: ‘float’ = 0.95, desired_kl: ‘float’ = 0.01, target_kl_stop: ‘Optional[float]’ = None, max_grad_norm: ‘float’ = 1.0, adaptive_kl_beta: ‘float’ = 0.9, adaptive_lr_growth: ‘float’ = 1.1, adaptive_lr_decay: ‘float’ = 1.2, adaptive_lr_update_interval: ‘int’ = 5, metrics_interval: ‘int’ = 8, finite_check_interval: ‘int’ = 8, enable_compile: ‘bool’ = True, warmup_strict_iters: ‘int’ = 10, warmup_metrics_interval: ‘int’ = 2, warmup_finite_check_interval: ‘int’ = 2, disable_finite_checks: ‘bool’ = True)

Parameters:
  • class_name (str)

  • value_loss_coef (float)

  • use_clipped_value_loss (bool)

  • clip_param (float)

  • entropy_coef (float)

  • num_learning_epochs (int)

  • num_mini_batches (int)

  • learning_rate (float)

  • schedule (str)

  • gamma (float)

  • lam (float)

  • desired_kl (float)

  • target_kl_stop (Optional[float])

  • max_grad_norm (float)

  • adaptive_kl_beta (float)

  • adaptive_lr_growth (float)

  • adaptive_lr_decay (float)

  • adaptive_lr_update_interval (int)

  • metrics_interval (int)

  • finite_check_interval (int)

  • enable_compile (bool)

  • warmup_strict_iters (int)

  • warmup_metrics_interval (int)

  • warmup_finite_check_interval (int)

  • disable_finite_checks (bool)

class_name: str = 'unilab.algos.torch.rsl_rl_ppo:FinalObservationAwarePPO'
value_loss_coef: float = 1.0
use_clipped_value_loss: bool = True
clip_param: float = 0.2
entropy_coef: float = 0.01
num_learning_epochs: int = 5
num_mini_batches: int = 4
learning_rate: float = 0.001
schedule: str = 'adaptive'
gamma: float = 0.99
lam: float = 0.95
desired_kl: float = 0.01
target_kl_stop: Optional[float] = None
max_grad_norm: float = 1.0
adaptive_kl_beta: float = 0.9
adaptive_lr_growth: float = 1.1
adaptive_lr_decay: float = 1.2
adaptive_lr_update_interval: int = 5
metrics_interval: int = 8
finite_check_interval: int = 8
enable_compile: bool = True
warmup_strict_iters: int = 10
warmup_metrics_interval: int = 2
warmup_finite_check_interval: int = 2
disable_finite_checks: bool = True
__init__(class_name='unilab.algos.torch.rsl_rl_ppo:FinalObservationAwarePPO', value_loss_coef=1.0, use_clipped_value_loss=True, clip_param=0.2, entropy_coef=0.01, num_learning_epochs=5, num_mini_batches=4, learning_rate=0.001, schedule='adaptive', gamma=0.99, lam=0.95, desired_kl=0.01, target_kl_stop=None, max_grad_norm=1.0, adaptive_kl_beta=0.9, adaptive_lr_growth=1.1, adaptive_lr_decay=1.2, adaptive_lr_update_interval=5, metrics_interval=8, finite_check_interval=8, enable_compile=True, warmup_strict_iters=10, warmup_metrics_interval=2, warmup_finite_check_interval=2, disable_finite_checks=True)
Parameters:
  • class_name (str)

  • value_loss_coef (float)

  • use_clipped_value_loss (bool)

  • clip_param (float)

  • entropy_coef (float)

  • num_learning_epochs (int)

  • num_mini_batches (int)

  • learning_rate (float)

  • schedule (str)

  • gamma (float)

  • lam (float)

  • desired_kl (float)

  • target_kl_stop (Optional[float])

  • max_grad_norm (float)

  • adaptive_kl_beta (float)

  • adaptive_lr_growth (float)

  • adaptive_lr_decay (float)

  • adaptive_lr_update_interval (int)

  • metrics_interval (int)

  • finite_check_interval (int)

  • enable_compile (bool)

  • warmup_strict_iters (int)

  • warmup_metrics_interval (int)

  • warmup_finite_check_interval (int)

  • disable_finite_checks (bool)

class unilab.structured_configs.PPOConfig[source]

Bases: BaseConfig

PPOConfig(algo: ‘str’ = ‘ppo’, algo_log_name: ‘str’ = ‘rsl_rl_ppo’, seed: ‘int’ = 1, num_envs: ‘int’ = 4096, num_steps_per_env: ‘int’ = 24, max_iterations: ‘int’ = 101, save_interval: ‘int’ = 100, empirical_normalization: ‘bool’ = False, runner_class_name: ‘str’ = ‘OnPolicyRunner’, obs_groups: ‘dict’ = <factory>, experiment_name: ‘str’ = ‘test’, run_name: ‘str’ = ‘’, resume: ‘bool’ = False, load_run: ‘str’ = ‘-1’, checkpoint: ‘int’ = -1, resume_path: ‘Optional[str]’ = None, policy: ‘PPOPolicyConfig’ = <factory>, algorithm: ‘PPOAlgorithmConfig’ = <factory>)

Parameters:
algo: str = 'ppo'
algo_log_name: str = 'rsl_rl_ppo'
seed: int = 1
num_envs: int = 4096
num_steps_per_env: int = 24
max_iterations: int = 101
save_interval: int = 100
empirical_normalization: bool = False
runner_class_name: str = 'OnPolicyRunner'
obs_groups: dict
experiment_name: str = 'test'
run_name: str = ''
resume: bool = False
load_run: str = '-1'
checkpoint: int = -1
resume_path: Optional[str] = None
policy: PPOPolicyConfig
algorithm: PPOAlgorithmConfig
__init__(algo='ppo', algo_log_name='rsl_rl_ppo', seed=1, num_envs=4096, num_steps_per_env=24, max_iterations=101, save_interval=100, empirical_normalization=False, runner_class_name='OnPolicyRunner', obs_groups=<factory>, experiment_name='test', run_name='', resume=False, load_run='-1', checkpoint=-1, resume_path=None, policy=<factory>, algorithm=<factory>)
Parameters:

unilab.dtype_config

Global dtype configuration for environments.

unilab.dtype_config.get_global_dtype()[source]

Get the global dtype for environment computations.

Return type:

dtype[Any]