Glutton is an anytime MDP solver based on the RTDP algorithm that
solves planning scenarios expressed in PPDDL optimally if given enough
resources. PPDDL versions of IPPC-2011 problems differ subtly but
fundamentally from benchmarks of previous competitions; the number of
action outcomes in them is often comparable with the size of the
problem's state space. Since RTDP and other Value Iteration-based
approaches enumerate action outcomes explicitly during Bellman backup,
they fail to scale under these circumstances. Glutton addresses this
issue by sampling just a few outcomes of each action in a given state
(typically no more than 50) and performing Bellman backups using the
sampled outcomes only.
Moreover, we observe that the effects of the "noop" action, which
describes the so-called \emph{natural dynamics} of the given problem,
are compiled into all other actions in every IPPC-2011 problem. For
extra efficiency, Glutton separates out the natural dynamics effects
out of all actions in a given state, samples these effects once, and
then re-combines the generated outcomes with samples of each action's
\emph{proper} effects. Thus, Glutton saves on resampling noop effects
for each action separately. However, sampling actions' effects as
above during every visit to a state would still be too costly
speedwise. Instead, for many state-action pairs Glutton \emph{caches}
the sampled action outcomes. With its size adjusted automatically, the
cache can devour nearly all of RAM available, giving Glutton both its
name and speed gains that can reach several orders of magnitude.
To provide good anytime performance, Glutton runs RTDP in an iterative
deepening fashion. Given a problem with a finite horizon of H steps,
it first solves a 1-step version of the problem, then uses the
obtained solution to produce a 2-step version of the problem, and so
on as long as there is time remaining. Once time runs out, Glutton
considers the resulting policy and an additional candidate policy for
each action that consists of repeating that action at every step until
the horizon is reached. The motivation for the action-specific
candidate policies is the observation that in many finite-horizon
scenarios, good policies tend to be cyclic, i.e. repeating the same
sequence of actions over and over again. Glutton's candidate policies
are the simplest examples of this class. Glutton compares the quality
of all these policies by sampling, chooses the one with the highest
estimated expected reward, and submits it to the server. On a
significant number of large problems some of the primitive cyclic
policies are appreciably better than the one produced by RTDP under a
tight time constraint, providing Glutton with a reasonable fall-back
solution.
A final component of Glutton's tactics was a mechanism that learned
which problem domains appeared "hard", dynamically allocated more time
to the hard problems, and scheduled the expected easiest instances to
be attempted first.