Reward Porting¶
Reward terms are where most porting bugs hide. This cookbook captures the common terms and their UniLab idiom.
Pattern: linear / quadratic tracking error¶
# Legged Gym
def _reward_tracking_lin_vel(self):
err = torch.sum(torch.square(self.commands[:, :2] - self.base_lin_vel[:, :2]), dim=1)
return torch.exp(-err / self.cfg.rewards.tracking_sigma)
# UniLab
def reward_tracking_lin_vel(self, state):
err = np.sum((state.commands[:, :2] - state.base_lin_vel[:, :2]) ** 2, axis=1)
return np.exp(-err / self.cfg.tracking_sigma)
Notes:
UniLab reward terms operate on a
statebatch (NumPy on CPU); no per-env loop, notorch.Return per-env scalar reward (shape
(n_envs,)).
Pattern: contact-conditional bonus¶
def reward_feet_air_time(self, state):
contact = state.foot_contact # bool, (n_envs, n_feet)
air_time = state.last_air_time # float, (n_envs, n_feet)
first_contact = contact & ~state.prev_contact
reward = (air_time - self.cfg.air_time_threshold) * first_contact
return reward.sum(axis=1)
Notes:
UniLab’s
statecarriesprev_contactso you don’t need to manage edge detection yourself. Seeunilab.envs.locomotion.common.rewards.
Pattern: action smoothness penalty¶
def reward_action_rate(self, state):
return -np.sum((state.action - state.prev_action) ** 2, axis=1)
Already a stock helper in unilab.envs.locomotion.common.rewards.
Pattern: posture penalty¶
def reward_dof_pos_limits(self, state):
lower = self.cfg.dof_pos_lower
upper = self.cfg.dof_pos_upper
deviation = (
np.maximum(0, lower - state.dof_pos) +
np.maximum(0, state.dof_pos - upper)
)
return -np.sum(deviation, axis=1)
Termination handling¶
UniLab separates terminal signal from terminal penalty. The env’s
terminations() returns a boolean mask; the reward registry can include
a termination_penalty term that consumes it.
def reward_termination(self, state):
return -state.termination.astype(np.float32) * self.cfg.termination_penalty
See also¶
unilab.training.rewardunilab.envs.locomotion.common.rewards