A More Expressive Behavioral Logic for Decision-Theoretic Planning

Charles Gretton

Overview
Decision-Theoretic Planning with Non-Markovian Rewards

(Non-)Markovian Reward Decision Processes
Past and Future-Looking Linear Temporal Logics
Expressing Rewards using Linear Temporal Logics
Non-Markovian to Markovian via Equivalence
Inference for Processes with Non-Markovian Rewards
Expressivity of Linear Temporal Logics
More expressive Forward-Looking Language

$ \newcommand{\until}{U} \newcommand{\always}{\box} \newcommand{\eventually}{\Diamond} \newcommand{\alphabet}{\Sigma} \newcommand{\nxt}{{\scriptstyle \bigcirc}} \newcommand{\reward}{R} \newcommand{\dollar}{\$} \newcommand{\since}{S} \newcommand{\last}{\circleddash} \newcommand{\once}{\var\diamond} \newcommand{\alwaysPast}{\boxminus} \newcommand{\xfltl}{\$FLTL} \newcommand{\mxfltl}{\$^*FLTL} \newcommand{\pltl}{\mbox{PLTL}} \newcommand{\implies}{\rightarrow} \newcommand{\xor}{\oplus} \newcommand{\true}{\mbox{$\top$}} \newcommand{\false}{\mbox{$\bot$}} \newcommand{\modelsb}[1]{\models_{\scriptsize #1}} \newcommand{\notmodelsb}[1]{\not\models_{\scriptsize #1}} \newcommand{\fut}{{\bf U}} \newcommand{\fbx}{\square} \newcommand{\fdi}{\diamond} \newcommand{\pbx}{\blacksquare} \newcommand{\pdi}{\blacklozenge} %\newcommand{\nxt}{\scriptstyle \bigcirc} \newcommand{\snc}{\bf S} \newcommand{\prv}{\circleddash} $

A Decision Process

Finite set of states $S$ with start state $s_0$ and actions $A$
States characterised by propositions heads, holdingBall
Actions induce stochastic state transitions
$S^*$ is the set of finite sequences of states over $S$
Real-valued reward function:

MDP	$ R:S \to \Re $
NMRDP	$R:S^* \to \Re$

A solution is $\pi$:

MDP	$ \pi :S \to A $
NMRDP	$ \pi :S^* \to A$

Executing $\pi$ yields a sequence of states $\Gamma $

A Decision Process

A solution is $\pi$:

MDP	$ \pi : S \to A $
NMRDP	$ \pi : S^* \to A$

Executing $\pi$ yields a sequence of states $\Gamma $
By $\Gamma_i$ we mean the state of index $i$ in $\Gamma$
By $\Gamma(i)$ we mean the prefix $\langle \Gamma_0, \ldots, \Gamma_i \rangle$ of $\Gamma$

Value of $\pi$

Value of $\pi$ is expected discounted cumulative reward:
In the case of an MDP we have:

$ V(\pi) = \lim_{n\rightarrow\infty} E \bigg{[} \sum_{i=0}^{n} \beta^{i} R(\Gamma_i) \mid \pi, \Gamma_{0} = s_{0} \bigg{]} $

In the case of an NMRDP we have:

$V(\pi) = \lim_{n\rightarrow\infty} E \bigg{[} \sum_{i=0}^{n} \beta^{i} R(\Gamma(i)) \mid \pi, \Gamma_{0} = s_{0} \bigg{]}$

Policy $\overline\pi$ is optimal if, for all $\pi$, we have that $V(\overline\pi) \ge V(\pi)$
There is an optimal MDP policy that is stationary
There may be no optimal NMRDP policy that is stationary

Specifying Rewarding Behaviours

Rewarding NMRDP behaviours are described using:

$\pltl$	Past-Looking Linear Temporal Logic
$\xfltl$	Fragment of Forward-Looking Linear Temporal Logic with One Uninterpreted Operator
$\mxfltl$	New Forward-Looking Linear Temporal Logic with Multiple Uninterpreted Operators

$R:S^* \to \Re$ is described finitely via a list of $\langle $formula$, $reward$\rangle $ pairs

$\pltl$ Tool for Specifying
Rewarding Behaviours

$\pltl$ is propositional logic with:
- $\last$, pronounced "previously"
  $ \Gamma(i) \models \last f $ IFF $ i > 0 $ and $ \Gamma(i-1) \models f $
- $\since$, pronounced "since"
  $ \Gamma(i) \models f \since g$ IFF $ \exists j \leq i, \Gamma(j) \models g $ and $ \forall k, j< k \leq i, \Gamma(k) \models f $
Operators $\pbx$ and $\pdi$ are syntactic sugar:
$ \pdi f =_{df} \true \since f$, pronounced "sometime in the past"
$ \pbx f=_{df} \neg\pdi\neg f $, pronounced "always in the past"

$\xfltl$ for Behaviours

Writing $B$ for a set of sequences of states, i.e. $B\subseteq S^*$
$\xfltl$ is propositional logic where:
- Symbol $\$$ is a 0-ary operator -- "get reward now"
  $ (\Gamma, i) \modelsb{B} \$ $ IFF $ \Gamma(i) \in B$
- Unary operator $\nxt$ means "in the next state"
  $ (\Gamma, i) \modelsb{B} \nxt f $ IFF $ (\Gamma, i + 1) \modelsb{B} f $
- $\fut$, pronounced "(weak) until"
  $ (\Gamma, i) \modelsb{B} f \fut g$ IFF $ \forall k \geq i $
  IF $ \forall j, i \leq j \leq k $ $ (\Gamma, j) \not\models_{B} g $ THEN $ (\Gamma, k) \modelsb{B} f $
Operator $\fbx$ is syntactic sugar:
$ \fbx f =_{df} f \fut \bot $, pronounced "always"

$\xfltl$ Behavior Specification

$\xfltl$ consequence relation indexed with $B$
By formula $f$ we intend behaviour $B_f$
$B_f$ is the minimum-sized set that yields a theorem
Online procedures require a deterministic test $\Gamma(i) \in B_f$

Without examining $\Gamma(i+j)$, $j>0$

Some exclusions include

$\$ \lor \nxt \$ $ -- i.e. non-deterministic
$\nxt a \to \$ $ -- i.e. requires we glimpse the future

$\xfltl$ Fragment in Practice

In practice we use a no funny business fragment
Where $M$ is modality free propositional formula
Operator $\$$ cannot be negated
To allocate reward $\langle f, r \rangle$ at $\Gamma_i$

$\xfltl$ Progression

Progression has a special first argument:

Are we allocating reward?

prg$_\$(\top,\Gamma_i,\$)$ $=_{df}$ $\top$

prg$_\$(\bot,\Gamma_i,\$)$ $=_{df}$ $\bot$

prg$_\$(b,\Gamma_i,p)$ $=_{df}$ $\top $ iff $p \in \Gamma_i$

prg$_\$(b,\Gamma_i,f_1 \wedge f_2)$ $=_{df}$ prg$_\$(b,\Gamma_i,f_1) \wedge$ prg$_\$(b,\Gamma_i,f_2)$

prg$_\$(b,\Gamma_i,f_1 \vee f_2)$ $=_{df}$ prg$_\$(b,\Gamma_i,f_1) \vee$ prg$_\$(b,\Gamma_i,f_2)$

prg$_\$(b,\Gamma_i,\nxt f)$ $=_{df}$ $f$

prg$_\$(b,\Gamma_i,f_{1} \fut f_{2})$ $=_{df}$ prg$_\$(b,\Gamma_i,f_{2}) \vee ($prg$_\$(b,\Gamma_i,f_{1}) \wedge f_{1} \fut f_{2}) $

$\xfltl$ Fragment in Practice

In practice we use a no funny business fragment
Where $M$ is modality free propositional formula
Operator $\$$ cannot be negated
To allocate reward $\langle f, r \rangle$ at $\Gamma_i$

Rew$(\Gamma_i,f) $	$=$	$\top$ iff prg$_{\$}(\bot,\Gamma_i,f)=\bot$
$\dollar$prg$(\Gamma_i,f)$	$=$	prg$_{\$}($Rew$(\Gamma_i,f),\Gamma_i,f)$

prg$_{\$}$ is our special version of formula progression
prg$_{\$}$ first argument is true iff $\Gamma(i) \in B_f$
If $\Gamma(i) \in B_f$ then allocate $r$ at $\Gamma_i$

NMRDP Solution Procedures

Using $ \pltl $ we have offline approaches

Pre-compute minimum-size equivalent MDP
Compute factored---Algebraic Decision Diagram---representation of value function

Using $\xfltl$ we have online approaches

Interleaved expansion of equivalent MDP and calculation of policy
e.g. execute LAO$^*$ for equivalent MDP
Calculates optimal policy without considering all states

Example NMRDP

Associate numeric rewards with LTL formulae

$\langle \fbx (heads~\implies~\$), 1.0 \rangle $
$ \langle \fbx ((heads \land (\nxt heads) \land (\nxt^{2} \lnot heads)) \implies \nxt^{2} \$), 1.0 \rangle $
$ \langle \fbx (((\nxt^{\{and, 3\}} \lnot heads) \land \nxt^{4} heads) \implies \nxt^{4} \$), 1.0 \rangle $

Dynamics are described using a stochastic automaton

An Equivalent MDP

Behaviours and Languages

Let $\alpha$ be an alphabet symbol -- e.g. a state is a "letter"
A regular expression is generated by the following

$\;|\;$ regx$^*$

$\cup$ is union
$\circ$ is concatenation
$*$ is Kleene closure

The corresponding language is called star-free, or equivalently non-counting, if it is generated by:

regx$^{-1}$ is complement

Star-Free Expressions Describe
Non-Counting Languages

Write $\alphabet$ for the alphabet
$\alphabet^*$ is the Kleene closure of the alphabet
$y^n$ is shorthand for $n$ concatenated occurrences of $y$

Definition:

Existing LTL Formalisms are
Expressively Weak

Take behaviours to be languages over states
$\pltl$ can only express non-counting behaviours
$\xfltl$ can only express non-counting behaviours

New More Expressive Language
Called $\mxfltl$

Consider the logic generated by the grammar:
$i\in N$ is a positive number,

i.e. multiple "reward" symbols

We can write $\$_1 \to \nxt \$_2$ -- i.e. think concatenation

Reward now implies reward later

union $(\$_1 \lor \$_2 ) \to \$_3$
intersection $ (\$_1 \land \$_2 ) \to \$_3$
complement $(\lnot \$_1) \to \$_3 $

Semantics for $\mxfltl$

Consequence is indexed using a list of behaviours $\mathcal{B}$

IFF

Semantics amenable to online solution procedures:

$\mathcal{B}^*_f $	$=_{df}$	$ \{\mathcal{B} \;\|\; \models_{\mathcal{B}} f \}$
$\mathcal{B}_f[0]$	$=$	$\bigcap_{\mathcal{B} \in \mathcal{B}^*_f} \mathcal{B}[0] $
$\mathcal{B}_f[i]$	$=$	$\bigcap_{\mathcal{B} \in \mathcal{B}^*_f, \mathcal{B}[0]=\mathcal{B}_f[0],..,B[i-1]=\mathcal{B}_f[i-1]} \mathcal{B}[i] $

Decide allocation to lower index terms earlier

New Variant can Count

$M \to \$_i$	letter-rule
$F_j \land \fbx (\$_j \lor \$_i) \;\; \{j < i\}$	complement
$F_j \land F_k \land \fbx ( (\$_j \land \$_k) \to \$_i ) \;\; \{j < k < i\} $	intersection
$F_j \land F_k \land \fbx ( (\$_j \lor \$_k) \to\ \$_i ) \;\; \{j < k < i\}$	union
$F_i \land F_{i+1} \land \ldots \land F_n \land \fbx (\$_n \to $
$ ( \nxt (F_{n+1}\land F_{n+2}\land \ldots \land F_{n+m} ) ) )$	concatenation
$F_i \land F_{i+1}\land \ldots \land F_n \land \fbx (\$_n \to $
$(\$_{n+1} \land \nxt (F_i\land F_{i+1}\land \ldots\land F_n) ))$	Closure

Conclusions

NMRDP solved using a translation to an equivalent MDP
Rewarding behaviours described using $\pltl$ & $\xfltl$
Both $\pltl$ and $\xfltl$ are expressively weak

Non-counting languages only

Developed new more expressive language $\mxfltl$

Regular languages

Model checking $\mxfltl$ has same runtime as $\xfltl$
Note, other LTL-based expressive languages

Quantified Propositional Temporal Logics
Extended Temporal Logic

prg\(_\$(\top,\Gamma_i,\$)\)	\(=_{df}\)	\(\top\)
prg\(_\$(\bot,\Gamma_i,\$)\)	\(=_{df}\)	\(\bot\)
prg\(_\$(b,\Gamma_i,p)\)	\(=_{df}\)	\(\top \) iff \(p \in \Gamma_i\)
prg\(_\$(b,\Gamma_i,f_1 \wedge f_2)\)	\(=_{df}\)	prg\(_\$(b,\Gamma_i,f_1) \wedge\) prg\(_\$(b,\Gamma_i,f_2)\)
prg\(_\$(b,\Gamma_i,f_1 \vee f_2)\)	\(=_{df}\)	prg\(_\$(b,\Gamma_i,f_1) \vee\) prg\(_\$(b,\Gamma_i,f_2)\)
prg\(_\$(b,\Gamma_i,\nxt f)\)	\(=_{df}\)	\(f\)
prg\(_\$(b,\Gamma_i,f_{1} \fut f_{2})\)	\(=_{df}\)	prg\(_\$(b,\Gamma_i,f_{2}) \vee (\)prg\(_\$(b,\Gamma_i,f_{1}) \wedge f_{1} \fut f_{2}) \)

Rew\((\Gamma_i,f) \)	\(=\)	\(\top\) iff prg\(_{\$}(\bot,\Gamma_i,f)=\bot\)
\(\dollar\)prg\((\Gamma_i,f)\)	\(=\)	prg\(_{\$}(\)Rew\((\Gamma_i,f),\Gamma_i,f)\)

\(\mathcal{B}^*_f \)	\(=_{df}\)	\( \{\mathcal{B} \;\|\; \models_{\mathcal{B}} f \}\)
\(\mathcal{B}_f[0]\)	\(=\)	\(\bigcap_{\mathcal{B} \in \mathcal{B}^*_f} \mathcal{B}[0] \)
\(\mathcal{B}_f[i]\)	\(=\)	\(\bigcap_{\mathcal{B} \in \mathcal{B}^*_f, \mathcal{B}[0]=\mathcal{B}_f[0],..,B[i-1]=\mathcal{B}_f[i-1]} \mathcal{B}[i] \)

\(M \to \$_i\)	letter-rule
\(F_j \land \fbx (\$_j \lor \$_i) \;\; \{j < i\}\)	complement
\(F_j \land F_k \land \fbx ( (\$_j \land \$_k) \to \$_i ) \;\; \{j < k < i\} \)	intersection
\(F_j \land F_k \land \fbx ( (\$_j \lor \$_k) \to\ \$_i ) \;\; \{j < k < i\}\)	union
\(F_i \land F_{i+1} \land \ldots \land F_n \land \fbx (\$_n \to \)
\( ( \nxt (F_{n+1}\land F_{n+2}\land \ldots \land F_{n+m} ) ) )\)	concatenation
\(F_i \land F_{i+1}\land \ldots \land F_n \land \fbx (\$_n \to \)
\((\$_{n+1} \land \nxt (F_i\land F_{i+1}\land \ldots\land F_n) ))\)	Closure