Machine learning techniques usually assume that the training and test data were generated from the same statistical distribution. This assumption, however, doesn’t go well with many real-life situations, mainly where adversaries may supply data that violates such statistical assumptions—for example, adversarial and stochastic games like Go and Chess.
DeepMind’s AlphaGo and its successors previously demonstrated that the policy and heuristic function is formulated upon the PUCT (Polynomial Upper Confidence Trees) search algorithm. This algorithm can be quite effective for guiding search in adversarial games. However, PUCT is computationally inefficient and lacks guarantees on its search effort. Though other methods such as LevinTS provide guarantees on search steps, they do not use a heuristic function.
A team of researchers from DeepMind and Alberta University addresses the deficiencies of such search algorithms. They have proposed Policy-guided Heuristic Search (PHS), a new search algorithm that uses a heuristic function and a policy. At the same time, it also offers guarantees on the search loss that is related to the quality of both the policy and the heuristics.
The PUCT search algorithm is designed to ensure that the value function of the model converges to the true value function in the limits of exploration. However, the guarantee is not valid when actual rewards are replaced with an estimated value. Moreover, it is impossible to ensure that the PUCT generality level is the best fit for complex deterministic single-agent problems.
The more recent A* combined algorithm tends to work better than PUCT. It has a learned heuristic function that trades-off solution quality for search time. However, the purpose of using A* is to find solutions of minimum cost, which becomes inconsistent with minimizing the search loss.
Levin Tree Search (LevinTS) is another method that utilizes a learned policy to lead its search, and it minimizes the number of search steps that make up for the quality of the policy. Still, it is incapable of learning heuristic functions.
The PHS proposed by the researchers combines LevinTS’s policy-guided search with the heuristic-guided search of A* in a novel algorithm. The PHS algorithm is designed so that it will solve all of them as quickly as possible when given a set of unknown tasks. This approach is different from traditional search algorithms that usually seek minimum path loss solutions for all tasks, whereas PHS minimizes the total search loss. Thus, it does not require the search algorithm solutions to be path-cost optimal.
PHS generalizes LevinTS by introducing a heuristic factor that is bigger than or equal to one and considering arbitrary non-negative loss rather than enforcing the value of loss to be one. The researchers propose that for any non-negative loss function, and any given policy, and any given heuristic factor, PHS will return a solution node. They also demonstrate that the search loss for PHS has an upper bound and that an accurate heuristic can significantly reduce the number of search steps.
The researchers compared their PHS algorithm with policy-guided and heuristic search algorithms PUCT, LevinTS, A, Weighted A (WA*), and Greedy Best-First Search (GBFS). They evaluated the algorithms on the 5×5 sliding-tile puzzle, Sokoban (Boxoban), and on a puzzle from the 2016 video game The Witness.
They found out that PHSh and PHS* algorithms obtained the best results on Sokoban. WA* and PHS* could also solve the Sliding Tile Puzzle problems, clearly outperforming all other algorithms. The study showed that PHS enables the rapid learning of both a policy and a heuristic function and performs well on all three test domains in terms of the number of problems solved and the search time required.