Our solver is based on Partially Observable Monte-Carlo Planning
(POMCP), a Monte-Carlo algorithm for online planning that has been
shown to be especially effective in very large POMDP problems [Silver
and Veness, 2010]. Unlike many other online planners, POMCP does not
use the branch-and-bound approach which requires initial bounds from
an offline planner. Our motivation for using an online planner and in
particular POMCP, is that we anticipated that given the time
constraint format of the competition, many of the problem instances
would be too large for an offline solver to produce a reasonable
solution or useful bounds in the time available for each
instance. POMCP makes use of a black-box simulator which, given a
state and action, generates a sample of the next state, observation
and reward. This is used to generate trajectories of states,
observations and rewards; from which a search tree of histories
(action-observation sequences) is build. As the search tree is
constructed, the set of sample states encountered by the simulator is
stored in each node of the search tree. To simulate each trajectory,
the start state is sampled from the current node, an action is chosen,
and the next state and observation are sampled from the simulator. The
action and observation brings us to the next node, and the process
repeats. The value of each node in the tree is the average return of
simulated trajectories from that node. The algorithm¢s approach for
balancing exploration and exploitation during search is to choose
actions using the UCB1 algorithm [Auer et. al, 2002] which augments
the value of an action by an exploration bonus that is highest for
rarely tried actions. In lieu of writing a parser, we opted to make
use of existing software. We used the parser module of the Symbolic
Perseus software [Poupart] for parsing-in problem instances and its
simulation module as the black-box simulator required by the POMCP
algorithm.
[Silver and Veness, 2010] -Y´Monte-Carlo Planning in Large POMDPs¡,
D. Silver, and J. Veness, Neural Information Processing Systems
(NIPS), 2010.
[Auer et. al, 2002] ´. Finite-time analysis of the multi-armed bandit
problem¡, P. Auer, N. Cesa-Bianchi, and P. Fischer, Machine Learning,
47(2-3):235-256, 2002.
[Poupart]
http://www.cs.uwaterloo.ca/~ppoupart/software.html#symbolic-perseus