Simulation Configuration
To test the performance of the trained policies, GOPS offers a gops.sys_simulator.sys_run.PolicyRunner class which can automatically simulate multiple trained policies and visualize the results for intuitive comparison, including the reward curve, state curves, action curves, etc.
In addition, PolicyRunner allows comparing the trained policies with an optimal controller, for example, the MPC-based optimal controller gops.sys_simulator.opt_controller.OptController (See Model Predictive Control (MPC) Module for details).
This page illustrates how to use PolicyRunner for simulation.
Parameters
log_policy_dir_list(list): A list of strings, each representing a directory of a trained policy to be simulated.trained_policy_iteration_list(list): A list of strings indicating policy models saved at which iteration are used for simulation, each element of which specifies a saved policy model.Note
The elements of
log_policy_dir_listandtrained_policy_iteration_listhave a one-to-one correspondence, and each pair, when joined together, leads to a specific policy model (i.e., a .pkl file). Thus,log_policy_dir_listandtrained_policy_iteration_listshould be of the same length. Also note that the order matters!save_render(bool): (Optional) Whether to save environment animation. Default toFalse.plot_range(list): (Optional) A list composed of two integrators customizing the plot range. If this is set to[a, b], only time steps in :math:[a, b]will be plotted when visualizing the results. IfNone, the whole trajectories will be plotted. Default toNone.legend_list(list): (Optional) A list of strings specifying legends in the result figures for each policy. Default toNone.Note
Likewise, the elements of
legend_listalso correspond to those oflog_policy_dir_listin a one-to-one manner.constrained_env(bool): (Optional) A boolean indicating whether the simulating environment is a constrained environment. For example, for policies trained on the environmentgops.env.env_ocp.pyth_veh3dofconti_surrcstr_data.SimuVeh3dofcontiSurrCstrwhich contains constraints, this parameter should be set toTrue. Default toNone.is_tracking(bool): (Optional) A boolean indicating whether the simulating environment is a tracking problem. For example, for policies trained on the environmentgops.env.env_ocp.pyth_veh3dofconti_data.SimuVeh3dofcontiwhere the goal is to control a vehicle to follow some reference paths, this parameter should be set toTrue. Default toNone.use_dist(bool): (Optional) A boolean indicating whether the simulating environment has adversarial actions. Default toFalse.dt(float): (Optional) Time interval between steps. IfNone, the result figures will useTime stepfor x-axis. Otherwise, the result figures will useTime (s)for x-axis, and the transformation from time steps to time is characterized bydt.is_init_info(bool): (Optional) Whether to customize initial information. Default toFalse.init_info(dict): (Optional) A dictionary specifying the initial information.Note
This parameter will be ignored if
is_init_infoisFalse. This parameter will be passed into the environment’sreset()method, which you may want to check for supported initial information of each environment.use_opt(bool): (Optional) Whether to use the optimal controller for comparison. Default toFalse.opt_args(dict): (Optional) A dictionary specifying the arguments of the optimal controller. Default toNone. See Usage Example for details.obs_noise_type(str): (Optional) Type of observation noise. Valid value: {“normal”, “uniform”,None}. IfNone, no noise is added to the observation, otherwise random noises either normally or uniformly distributed will be introduced. Default toNone.obs_noise_data(list): (Optional) A list specifying the parameters for the noise distribution. This should contain two lists, the length of which should both equals the dimension of observations. For normally distributed noise, the first list represents the mean and the second list represents the standard deviation. For uniformly distributed noise, the first list represents the lower bound and the second list represents the upper bound. Default toNone.action_noise_type(str): (Optional) Type of action noise. Valid value: {“normal”, “uniform”,None}. IfNone, no noise is added to the action, otherwise random noises either normally or uniformly distributed will be introduced. Default toNone.action_noise_data(list): (Optional) A list specifying the parameters for the noise distribution. This should contain two lists, the length of which should both equals the dimension of actions. For normally distributed noise, the first list represents the mean and the second list represents the standard deviation. For uniformly distributed noise, the first list represents the lower bound and the second list represents the upper bound. Default toNone.
Usage Example
The template for using PolicyRunner is offered in gops.example_run.template_run_environ_alg.py, showing the possible setting of all the parameters. The most complicated part is the setting of opt_args, which will be illustrated in detail here.
To enable a optimal controller for comparison, first make sure to set
runner = PolicyRunner(
# other parameters for calling PolicyRunner
use_opt=True,
}
For some specific environments, the optimal control inputs can be computed analytically, so a theoretically optimal controller is implemented for each of these environments. To specify an optimal controller, you may set
runner = PolicyRunner(
# other parameters for calling PolicyRunner
use_opt=True,
opt_args={"opt_controller_type": "OPT"},
}
For most environments, only the MPC-based optimal controller gops.sys_simulator.opt_controller.OptController is available, under which condition the codes should be like
runner = PolicyRunner(
# other parameters for calling PolicyRunner
use_opt=True,
opt_args={
"opt_controller_type": "MPC",
"num_pred_step": 5,
# other parameters for instantiating an OptController
},
}
This will instantiate an OptController with a prediction step of 5.
Note
The parameter model needed for OptController’s initialization is automatically created and passed according to your simulation environment, so it won’t be necessary to specify a model in opt_args.
To customize other parameters, just add a key-value pair into opt_args. You may refer to Model Predictive Control (MPC) Module for the meaning of OptController’s parameters.
A tricky thing is the usage of the terminal cost function. By default, no terminal cost function is introduced. To enable one, set
runner = PolicyRunner(
# other parameters for calling PolicyRunner
use_opt=True,
opt_args={
"opt_controller_type": "MPC",
"num_pred_step": 5,
"use_terminal_cost": True,
# other parameters for instantiating an OptController
},
}
This works for environments whose models a default terminal cost function is integrated into. But for those environment models without a default terminal cost function, you need to define your own one and pass it to OptController by
# Load value approximate function
value_net = load_apprfunc("../results/INFADP/lqs4a2_poly", "115000_opt").v
# Define terminal cost of MPC controller
def terminal_cost(obs):
obs = obs.unsqueeze(0)
return -value_net(obs).squeeze(-1)
runner = PolicyRunner(
# other parameters for calling PolicyRunner
use_opt=True,
opt_args={
"opt_controller_type": "MPC",
"num_pred_step": 5,
"use_terminal_cost": True,
"terminal_cost": terminal_cost,
# other parameters for instantiating an OptController
},
}
In this example, a state value network learned by the reinforcement learning algorithm is used as a terminal cost function.
A whole example of using PolicyRunner for simulation is as follows.
from gops.sys_simulator.call_terminal_cost import load_apprfunc
from gops.sys_simulator.sys_run import PolicyRunner
# Load value approximate function
value_net = load_apprfunc("../results/INFADP/lqs4a2_poly", "115000_opt").v
# Define terminal cost of MPC controller
def terminal_cost(obs):
obs = obs.unsqueeze(0)
return -value_net(obs).squeeze(-1)
runner = PolicyRunner(
# Parameters for policies to be run
log_policy_dir_list=["../results/INFADP/lqs4a2_mlp",
"../results/INFADP/lqs4a2_mlp",
"../results/INFADP/lqs4a2_mlp",
"../results/INFADP/lqs4a2_poly"],
trained_policy_iteration_list=["4000",
"5000",
"6000",
"115000_opt"],
# Save environment animation or not
save_render=False,
# Customize plot range
plot_range=[0, 100],
# Legends for each policy in figures
legend_list=["InfADP-4000-mlp",
"InfADP-5000-mlp",
"InfADP-6000-mlp",
"InfADP-115000-poly"],
# Constrained environment or not
constrained_env=False,
# Tracking problem or not
is_tracking=False,
# Use adversarial action or not
use_dist=False,
# Parameter for time interval between steps
dt=0.1,
# Parameters for environment initial info
is_init_info=True,
init_info={"init_state": [0.5, 0.2, 0.5, 0.1]},
# Parameters for optimal controller
use_opt=True,
opt_args={
"opt_controller_type": "MPC",
"num_pred_step": 5,
"ctrl_interval": 1,
"gamma": 0.99,
"minimize_Options": {"max_iter": 200, "tol": 1e-4,
"acceptable_tol": 1e-2,
"acceptable_iter": 10,},
"use_terminal_cost": True,
"terminal_cost": terminal_cost,
"verbose": 0,
"mode": "collocation",
},
# Parameter for obs noise
obs_noise_type="normal",
obs_noise_data=[[0.] * 4, [0.1] * 4],
# Parameter for action noise
action_noise_type=None,
action_noise_data=None,
)
runner.run()
Model Predictive Control (MPC) Module
The MPC method is implemented in GOPS to offer a baseline for comparing different algorithms, which offers the following features:
Nonlinear model predictive control
Support for environment models implemented in PyTorch
Support for two solving methods: direct collocation and direct shooting
User-assigned control interval for move blocking strategy
Support for passing user-defined functions as terminal cost (including neural networks)
Specifiable discounting factor that enables a unified problem formulation with RL community
Flexible optimization options for tuning
We adopt IPOPT (Interior Point Optimizer), an open source software package for large-scale nonlinear optimization, to solve the nonlinear programming problem constructed at each timestep.
This page illustrates the meaning of the parameters for instantiating a gops.sys_simulator.opt_controller.OptController class.
Parameters
model(gops.env.env_ocp.pyth_base_model.PythBaseModel): Model of the environment to work onnum_pred_step(int): Total steps of prediction, specifying how far to look into the future.ctrl_interval(Optional[int]): Optimal control inputs are computed everyctrl_intervalsteps. For example, ifnum_pred_stepequals 10, andctrl_intervalequals 2, then control inputs will be computed at timestep 0, 2, 4, 6 and 8. Control inputs at rest timesteps are set in a zero-order holder manner. Default to1.
Note
ctrl_interval should be a factor of num_pred_step.
gamma(Optional[int]): Discounting factor. Valid range: [0, 1]. Default to1.0.use_terminal_cost(Optional[bool]): Whether to use terminal cost. Default toFalse.terminal_cost(Optional[Callable[[torch.Tensor], torch.Tensor]]): Self-defined terminal cost function returning a Tensor of shape [] (scalar). Ifuse_terminal_costisTrueandterminal_costisNone,OptControllerwill use default terminal cost function of the environment model (if exists). Default toNone.minimize_options(Optional[dict]): Options for minimizing to be passed to IPOPT. See IPOPT Options for details. Default toNone.verbose(Optional[int]): Whether to print summary statistics. Valid value: {0, 1}. Default to0.mode(Optional[str]): Specify the method to be used to solve optimal control problem. Valid value: {“shooting”, “collocation”}. Default to"collocation".
Usage
The MPC controller is meant to be used by a gops.sys_simulator.sys_run.PolicyRunner, for offering an optimal baseline while simulating trained policies. See Simulation Configuration for details.