Posters‎ > ‎

State Similarity Based Approach for Improving Performance in RL

posted Sep 2, 2008, 10:33 AM by Brian Tanner   [ updated Sep 2, 2008, 10:54 AM ]

Sertan Girgin, Faruk Polat and Reda Alhajj


In most of the realistic and complex domains, the task that a reinforcement learning agent tries to solve contains various subtasks, each of which repeats many times at different regions of the state space. Although the solutions of the instances of the same or similar subtasks are almost identical, without any guidance, an agent has to learn related sub-policies independent of each other; this would cause the agent to pass through similar learning stages again and again, and as a result it will be harder to converge t to optimal behavior in reasonable time. The main reason of the problem is the lack of connections that would allow solution sharing. Based on the fact that states with similar patterns of behavior constitute the regions of state space corresponding to different instances of similar subtasks, the notion of state equivalence can be used as a mechanism to facilitate the sharing of solutions. By reflecting experience acquired on one state to all similar states, connections between similar subtasks can be established implicitly; this would reduce the repetition in learning, and consequently improve the performance.

In this paper, based on the (partial) MDP homomorphism notion of Ravindran and Barto, we propose a method to identify states with similar sub-policies without requiring a model of the MDP or equivalence relations, and show how they can be integrated into the reinforcement learning framework. As the learning progresses, using the observed history of events, potentially useful policy fragments are generated and stored in a labeled tree. A metric, which is based on the number of common action-reward sequences, is devised to measure the similarity between two states, and the tree is utilized to apply the metric and determine states with similar sub-policy behavior. Eligibility requirements are imposed on the structure of the tree in order to keep its size manageable and focus on recent and frequently used sub-policies. Updates on the value function of a state are then reflected to all similar states, expanding the influence of new experiences. The proposed method can be treated as a meta-heuristic and it is possible to apply it to existing reinforcement learning algorithms. We demonstrate the effectiveness of the approach by reporting test results on sample problems.
Comments