Design of complex and hardly engineering behaviors in robotics can be approached with the use of Reinforcement learning (RL), which gives framework and certain tools for it. There is strong relationship between these two disciplines: reinforcement learning and robotics. Usage of reinforcement learning algorithms for simulation behavior in robots stresses strong links between the two research areas, which I attempt to describe in this paper.
Reinforcement learning is the field that defines algorithms by which a variety of problems in robotics may be solved. These algorithms are the ones which suitable for a robot to autonomously discover an optimal behavior through try and improve via error and success approach interactions with the environment. In reinforcement learning the designer of a control task provides feedback with the means of a scalar objective function which rates the single step productivity of a robot, in lieu of explicitly hard-coding the solution of a problem.
Let's take into perspective variable of robots which have learned solutions of the tasks using reinforcement learning. Let's take into consideration the case of training a robot to bring back a table tennis ball over the net. In this case the robot can make an observations of dynamic variables detecting ball's coordinates, speed and the whole dynamics of the combination of coordinates. Such observations will capture the states of the system — giving a complete statistic for forecasting different observations in future.
Actions which are available for the robot might be the twisting force that tends to cause rotation and sent to its motors; or the required accelerations sent to a dynamics control system. Lets use a symbol of 'π' as a function that generates the motor commands/actions based on the incoming ball and by policy which we will call the current internal observations. The reinforcement learning problem is to determine such a policy that give an optimum value of the long term sum of rewards. Let's symbolize it as 'R'('s', 'a'). A reinforcement learning algorithm is the one which is defined by finding such an optimal policy, and it should be as optimal as possible.
The function we want to optimize, is called the reward function, for mentioned example could be based on energy consumption (if energy importance criteria is concerned), or by the success in the ball hits. Reinforcement learning is an area of Machine Learning (ML) in which an agent (an Artificial Intelligence system) receives feedback on the final result as a choices upon which it makes the exploration in the space of possible strategies and actions. Reinforcement learning algorithm will make conclusion from this information, which will comprise in optimal policy/strategy. Reinforcement learning may be somehow different from other areas of Machine Learning. For better reinforcement learning understanding and its relation with techniques widely used within robotics, let's think about the following — the complexity of sequential interaction and the complexity of reward structure. Let's consider such problems like binary classification, supervised learning, cost-sensitive learning and structured prediction. Let's take a look at Table 1 which present these problems with regards to interactive and reward complexity.
Table 1
Learning problems with their complexity in interactive and rewards aspects
Technique name |
Interactive complexity |
Reward complexity |
binary classification |
none |
none |
supervised learning |
none |
middle |
imitation learning |
middle |
middle |
cost-sensitive learning |
low |
high |
structured prediction |
middle |
high |
RL |
the highest |
the highest |
Let's see the hierarchy of mentioned problems, and the relations between them. It is important to understand the diversity and the mutual-relation between mentioned learning problems. Table 1 depicts the relationship between areas of machine learning relevant to robotics. It is important to understand that reinforcement learning includes much of the scope of classical machine learning problems and also structured prediction problems, and problems which mentioned in Table 1. One of the main techniques in machine learning, reduction algorithms, are used to convert efficient solutions of one class of tasks into efficient solutions of others. In Supervised learning, which plays a crucial role in binary classification and regression, a learner’s goal is to map 'observations' to 'actions', which are usually a discrete set of classes (as for classification) or a real values (as for regression). There is no interactive component, unless such paradigms as online learning are concerned. That's why this type of learning does not take into account decision made by the learner, hence supervised learning algorithms have such a trait that has to deal with operating in a space of actions, in which each decision has absolutely no effect on future events. There is determinism in action choices within supervised learning scenarios, since during a train phase algorithm gets labels of correct answers. In contrary in cost-sensitive learning — the reward is sophisticated, where every training sample and every action/prediction is labeled by a cost (which is a number) for concluding such a prediction.
Associative reinforcement learning problems deal with fundamental problems of exploration versus exploitation, for which all what is known is only about a chosen action. These problems find wide-spread application in problems like: next video suggestion (like it is done at Netflix or Youtube web-sites) or friend suggestion (like in Facebook).
One of the most important technique of computer vision and robotics, a 'structured prediction', may be seen as a simplified variant of 'imitation learning', since most of predictions are calculated by using to maximize advantages of inter-relations between them. The concept of imitation learning is such, that we assume that an expert which we are trying to imitate — provides demonstrations of a task. In this type of learning complexity arises since any error by the learner changes future observations from the ones which have been until now, with regards to the agent, which has chosen the certain controls. Problems like this demonstrably drive to complicated mistakes and breach the primary assumption of independent samples, which are needed for accomplishing an aim of supervised learning.
All of such traits as interactivity, sequential prediction as in imitation learning, in addition to compound reward structures feedback on the chosen actions — are covered in reinforcement learning. In other words, reinforcement learning is the combination which enables many problems related to robotics be formulated in reinforcement learning terms, as well as making problems computationally hard. A standard reinforcement learning problem is in which there is additional advantage for the learner, that it can draw initial states out of a distribution provided by an expert, termed by 'baseline distribution reinforcement learning'. This standard problem has learning complexity, which is dramatically being affected by this additional requirement that we have certain initial states.
Such problems as dynamic programming, stochastic programming, theory of classical optimal control, stochastic search, as well as simulation-optimization, and optimal stopping are very related to the reinforcement learning in the scope of optimal control. Optimal control, as well as reinforcement learning — address the problem of finding an optimal policy (controller policy problem) that optimizes an objective function, which can be either the reward or aggregated cost, both of which rely on the concept of the system by an underlying group of states.
Very importantly, optimal control assumes absolute knowledge of the system’s description in the form of a model, that is to say we should know which will be the next state of the agent. So optimal control assures firm guarantees which, however, often do not sustain due to model and computational approximations. In absolute difference to optimal control — reinforcement learning operates immediately on rewards from interaction with the environment and measured information.
Solving the problems which are analytically stubborn using data-driven techniques and approximations — is what reinforcement learning research has placed great focus on. The use of classical optimal control techniques (like differential dynamic programming and linear-quadratic regulation) to system models learned via repeated interplay with the environment — is one of the most substantial approaches to reinforcement learning. Reinforcement learning can be viewed as adaptive optimal control.
Robotics as a reinforcement learning domain has a great difference from most reinforcement learning benchmark problems. Common practice in robotics is to represent problems with high-dimension, actions and continuous states. It is usually unreal in robotics to suppose that the true state is completely noise-free and observable. The learning agent will not know definitively in which state it is, as even vastly different states may look extremely similar. In this way robotics reinforcement learning are frequently modeled as partially observed. It is usually vital to maintain the information state of the environment which not only contains the raw observations but also a notion of uncertainty of its approximations. It is expensive, exhausting to obtain, and usually hard to reconstitute the experience on a real physical system. For example, for the robot table tennis system — it is impossible even getting to the same initial state. Each trial run is costly and hence such applications force us to concentrate on difficulties which do not emerge as frequently in classical reinforcement learning benchmark examples.
Real-world experience has to be used, even though have a high cost, because it usually cannot be replaced with simple simulations. For highly dynamic tasks, but also for other tasks, even small modeling errors in analytical or learned models of the system can accumulate to a substantially different behavior. So even small errors matter. Therefore, algorithms need to be robust regarding to models which do not capture all the details of a real system, which is also mentioned as undermodeling, and to model uncertainty.
The generation of suitable reward functions is one more challenge commonly faced in robot reinforcement learning. To cope with the cost of real-world experience we need rewards that guide the learning system swiftly to success. The name of such problem is 'reward shaping' and represents a significant manual contribution. It requires plenty of domain knowledge to specify good reward functions in robotics, and may often be hard in practice. For the robotics domain — not every reinforcement learning method is identically suitable. Rather than being value function-based approaches, — many of the methods thus far demonstrated on difficult problems have been model-based and robot learning systems often employ policy search methods.
The maximization of the aggregated reward over agent's life-time is what reinforcement learning algorithms are aimed for. In case of an episodic setting, where the task is restarted after every end of an episode, the aim is to maximize the total reward per episode. If the task is continuous without an apparent beginning and end, either the average reward over the entire life-time or a discounted return can be optimized. In such reinforcement learning problems, the agent and its environment may be modeled, in which agent can perform actions 'a' ∈ 'A' and to be in a state 's' ∈ 'S', each of which may can be multi-dimensional and be members of either discrete or continuous sets. A state 's' contains all specific information with regards to the current situation to predict future states (also called as observables). The example of the last sentence would be the current position of a robot in a navigation task. An action 'a' is used to control (or alter) the system's state. For instance, in navigation task may be the actions corresponding to torques applied to wheels of a bicycle. The agent gets a reward 'R', for every step, which is a function of the state and observation giving a scalar value. In the navigation task, a reward could be defined on basis of the energy costs for taken actions and rewards for reaching goals.
Reinforcement learning's goal is to find a mapping from states 'S' to actions 'A', called policy 'π', that picks actions 'a' in given states 's' is such a way that the cumulative expected reward would be maximized. The policy 'π' is either probabilistic or deterministic. The approach uses the exact same action for a given state in the form 'a' = 'π'('s'), the later chooses one that from a distribution over actions when it arrives to 'a' state: 'a' ∼ 'π'('s', 'a') = P ('a'|'s'). The reinforcement learning agent needs to reveal the relations between actions, states, and rewards. Hence exploration is required. This exploration can either be directly included in the policy or performed separately and only as part of the learning routine.
Assumption of having a Markov Decision Process consisting of the group of states 'S', group of actions 'A', the rewards 'R' and transition probabilities 'T' that capture the dynamics of a system is what classical reinforcement learning approaches are based on. Transition probabilities 'T'('s'_{new}, 'a', 's') = P('s'_{new}| 's', 'a') describe the effects of the actions on the state. Transition probabilities do generalization of the notion of deterministic dynamics, so that they would allow modeling the outcomes. The Markov property demands that the next state 's'_{new} and the reward only depend on the previous state 's' and action 'a', without taking into account additional information about the past states and actions. Markov property recapitulates the idea of state — a state is a sufficient statistic for predicting the future, and previous observations are not important. Though, in robotics, we may give only some approximate notion of state.
Different types of reward functions are commonly used, including rewards depending on the current state and action 'R' = 'R'('s', 'a'), rewards depending only on the current state 'R' = 'R'('s'), and rewards including the transitions 'R' = 'R'('s'_{new}, 'a', 's'). Most of the theoretical guarantees only hold if the problem abides to a Markov structure, nevertheless in practice, many approaches work fine for lots of problems that do not hold this requirement.
References:
- Benbrahim, H. and Franklin, J. A. (1997). Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems, 22(3–4):283– 302.
- Bentivegna, D. C., Atkeson, C. G., and Cheng, G. (2004). Learning from observation and practice using behavioral primitives: Marble maze.
- Betts, J. T. (2001). Practical methods for optimal control using nonlinear programming, volume 3 of Advances in Design and Control. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.
- Birdwell, N. and Livingston, S. (2007). Reinforcement learning in sensor-guided AIBO robots. Technical report, University of Tennesse, Knoxville. advised by Dr. Itamar Elhanany.
- Bishop, C. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer.
- Bakker, B., Zhumatiy, V., Gruener, G., and Schmidhuber, J. (2003). A robot that reinforcement-learns to identify and memorize important previous observations. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.
- Benbrahim, H., Doleac, J., Franklin, J., and Selfridge, O. (1992). Real-time learning: A ball on a beam. In International Joint Conference on Neural Networks (IJCNN).
- Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems (NIPS).
- Stulp, F., Theodorou, E., and Schaal, S. (2011). Learning variable impedance control. International Journal of Robotic Research, 30(7):820– 833.
- Bukkems, B., Kostic, D., de Jager, B., and Steinbuch, M. (2005). Learning-based identification and iterative learning control of direct-drive robots. IEEE Transactions on Control Systems Technology, 13(4):537–549.
- Coates, A., Abbeel, P., and Ng, A. Y. (2009). Apprenticeship learning for helicopter control. Communications of the ACM, 52(7):97–105.
- Cocora, A., Kersting, K., Plagemann, C., Burgard, W., and de Raedt, L. (2006). Learning relational navigation policies. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Conn, K. and Peters II, R. A. (2007). Reinforcement learning with a supervisor for a mobile robot in a real-world environment. In IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA).
- Daniel, C., Neumann, G., and Peters, J. (2012). Learning concurrent motor skills in versatile solution spaces. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Bagnell, J. A. and Schneider, J. C. (2001). Autonomous helicopter control using reinforcement 31 learning policy search methods. In IEEE International Conference on Robotics and Automation (ICRA).
- Baird, L. C. and Klopf, H. (1993). Reinforcement learning with high-dimensional continuous actions. Technical Report WL-TR-93–1147, Wright Laboratory, Wright-Patterson Air Force Base, OH 45433- 7301.
- Dayan, P. and Hinton, G. E. (1997). Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278.
- Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In 28th International Conference on Machine Learning (ICML).
- Deisenroth, M. P., Rasmussen, C. E., and Fox, D. (2011). Learning to control a low-cost manipulator using data-efficient reinforcement learning. In Robotics: Science and Systems (R:SS).
- An, C. H., Atkeson, C. G., and Hollerbach, J. M. (1988). Model-based control of a robot manipulator. MIT Press, Cambridge, MA, USA.
- Argall, B. D., Browning, B., and Veloso, M. (2008). Learning robot motion control with demonstration and advice-operators. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Atkeson, C. G. (1998). Nonparametric model-based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS).
- Bagnell, J. A. (2004). Learning Decisions: Robustness, Uncertainty, and Approximation. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
- Bagnell, J. A., Ng, A. Y., Kakade, S., and Schneider, J. (2003). Policy search by dynamic programming. In Advances in Neural Information Processing Systems (NIPS).