A More Expressive Behavioral Logic for Decision-Theoretic Planning


Charles Gretton


Overview

Decision-Theoretic Planning with Non-Markovian Rewards

  • (Non-)Markovian Reward Decision Processes
  • Past and Future-Looking Linear Temporal Logics
  • Expressing Rewards using Linear Temporal Logics
  • Non-Markovian to Markovian via Equivalence
  • Inference for Processes with Non-Markovian Rewards
  • Expressivity of Linear Temporal Logics
  • More expressive Forward-Looking Language
\( \newcommand{\until}{U} \newcommand{\always}{\box} \newcommand{\eventually}{\Diamond} \newcommand{\alphabet}{\Sigma} \newcommand{\nxt}{{\scriptstyle \bigcirc}} \newcommand{\reward}{R} \newcommand{\dollar}{\$} \newcommand{\since}{S} \newcommand{\last}{\circleddash} \newcommand{\once}{\var\diamond} \newcommand{\alwaysPast}{\boxminus} \newcommand{\xfltl}{\$FLTL} \newcommand{\mxfltl}{\$^*FLTL} \newcommand{\pltl}{\mbox{PLTL}} \newcommand{\implies}{\rightarrow} \newcommand{\xor}{\oplus} \newcommand{\true}{\mbox{$\top$}} \newcommand{\false}{\mbox{$\bot$}} \newcommand{\modelsb}[1]{\models_{\scriptsize #1}} \newcommand{\notmodelsb}[1]{\not\models_{\scriptsize #1}} \newcommand{\fut}{{\bf U}} \newcommand{\fbx}{\square} \newcommand{\fdi}{\diamond} \newcommand{\pbx}{\blacksquare} \newcommand{\pdi}{\blacklozenge} %\newcommand{\nxt}{\scriptstyle \bigcirc} \newcommand{\snc}{\bf S} \newcommand{\prv}{\circleddash} \)

A Decision Process

  • Finite set of states \(S\) with start state \(s_0\) and actions \(A\)
  • States characterised by propositions heads, holdingBall
  • Actions induce stochastic state transitions
  • \(S^*\) is the set of finite sequences of states over \(S\)
  • Real-valued reward function:
  • MDP \( R:S \to \Re \)
    NMRDP \(R:S^* \to \Re\)
  • A solution is $\pi$:
  • MDP \( \pi :S \to A \)
    NMRDP \( \pi :S^* \to A\)
  • Executing \(\pi\) yields a sequence of states \(\Gamma \)

A Decision Process

  • A solution is $\pi$:
  • MDP \( \pi : S \to A \)
    NMRDP \( \pi : S^* \to A\)
  • Executing \(\pi\) yields a sequence of states \(\Gamma \)
  • By \(\Gamma_i\) we mean the state of index \(i\) in \(\Gamma\)
  • By \(\Gamma(i)\) we mean the prefix \(\langle \Gamma_0, \ldots, \Gamma_i \rangle\) of \(\Gamma\)

Value of \(\pi\)

  • Value of $\pi$ is expected discounted cumulative reward:
  • In the case of an MDP we have:
  • \( V(\pi) = \lim_{n\rightarrow\infty} E \bigg{[} \sum_{i=0}^{n} \beta^{i} R(\Gamma_i) \mid \pi, \Gamma_{0} = s_{0} \bigg{]} \)
  • In the case of an NMRDP we have:
  • \(V(\pi) = \lim_{n\rightarrow\infty} E \bigg{[} \sum_{i=0}^{n} \beta^{i} R(\Gamma(i)) \mid \pi, \Gamma_{0} = s_{0} \bigg{]}\)
  • Policy \(\overline\pi\) is optimal if, for all \(\pi\), we have that \(V(\overline\pi) \ge V(\pi)\)
  • There is an optimal MDP policy that is stationary
  • There may be no optimal NMRDP policy that is stationary

Specifying Rewarding Behaviours

  • Rewarding NMRDP behaviours are described using:

\(\pltl\) Past-Looking Linear Temporal Logic

\(\xfltl\)Fragment of Forward-Looking Linear Temporal Logic with One Uninterpreted Operator

\(\mxfltl\)New Forward-Looking Linear Temporal Logic with Multiple Uninterpreted Operators

  • \(R:S^* \to \Re\) is described finitely via a list of \(\langle \)formula\(, \)reward\(\rangle \) pairs

\(\pltl\) Tool for Specifying
Rewarding Behaviours

  • \(\pltl\) is propositional logic with:
    • \(\last\), pronounced "previously"
      \( \Gamma(i) \models \last f \) IFF \( i > 0 \) and \( \Gamma(i-1) \models f \)
    • \(\since\), pronounced "since"
      \( \Gamma(i) \models f \since g\) IFF \( \exists j \leq i, \Gamma(j) \models g \) and \( \forall k, j< k \leq i, \Gamma(k) \models f \)
  • Operators $\pbx$ and $\pdi$ are syntactic sugar:
    \( \pdi f =_{df} \true \since f\), pronounced "sometime in the past"
    \( \pbx f=_{df} \neg\pdi\neg f \), pronounced "always in the past"

\(\xfltl\) for Behaviours

  • Writing \(B\) for a set of sequences of states, i.e. \(B\subseteq S^*\)
  • \(\xfltl\) is propositional logic where:
    • Symbol \(\$\) is a 0-ary operator -- "get reward now"
      \( (\Gamma, i) \modelsb{B} \$ \) IFF \( \Gamma(i) \in B\)
    • Unary operator \(\nxt\) means "in the next state"
      \( (\Gamma, i) \modelsb{B} \nxt f \) IFF \( (\Gamma, i + 1) \modelsb{B} f \)
    • \(\fut\), pronounced "(weak) until"
      \( (\Gamma, i) \modelsb{B} f \fut g\) IFF \( \forall k \geq i \)
      IF \( \forall j, i \leq j \leq k \) \( (\Gamma, j) \not\models_{B} g \) THEN \( (\Gamma, k) \modelsb{B} f \)
  • Operator $\fbx$ is syntactic sugar:
    \( \fbx f =_{df} f \fut \bot \), pronounced "always"

\(\xfltl\) Behavior Specification

  • \(\xfltl\) consequence relation indexed with \(B\)
  • By formula \(f\) we intend behaviour \(B_f\)
  • \(B_f\) is the minimum-sized set that yields a theorem
  • \(B_f \;\equiv\; \bigcap\{ B \;|\; \models_{B} f \} \)
  • Online procedures require a deterministic test \(\Gamma(i) \in B_f\)
    • Without examining \(\Gamma(i+j)\), \(j>0\)
  • Some exclusions include
    • \(\$ \lor \nxt \$ \) -- i.e. non-deterministic
    • \(\nxt a \to \$ \) -- i.e. requires we glimpse the future

\(\xfltl\) Fragment in Practice

  • In practice we use a no funny business fragment
  • \(F ::= \$ \;|\; \nxt F \;|\; M \lor F \;|\; F \fut M \;|\; F \land F \)
  • Where \(M\) is modality free propositional formula
  • Operator \(\$\) cannot be negated
  • To allocate reward \(\langle f, r \rangle\) at \(\Gamma_i\)

\(\xfltl\) Progression

  • Progression has a special first argument:
    • Are we allocating reward?

prg\(_\$(\top,\Gamma_i,\$)\) \(=_{df}\) \(\top\)
prg\(_\$(\bot,\Gamma_i,\$)\) \(=_{df}\) \(\bot\)
prg\(_\$(b,\Gamma_i,p)\) \(=_{df}\) \(\top \) iff \(p \in \Gamma_i\)
prg\(_\$(b,\Gamma_i,f_1 \wedge f_2)\) \(=_{df}\) prg\(_\$(b,\Gamma_i,f_1) \wedge\) prg\(_\$(b,\Gamma_i,f_2)\)
prg\(_\$(b,\Gamma_i,f_1 \vee f_2)\) \(=_{df}\) prg\(_\$(b,\Gamma_i,f_1) \vee\) prg\(_\$(b,\Gamma_i,f_2)\)
prg\(_\$(b,\Gamma_i,\nxt f)\) \(=_{df}\) \(f\)
prg\(_\$(b,\Gamma_i,f_{1} \fut f_{2})\) \(=_{df}\) prg\(_\$(b,\Gamma_i,f_{2}) \vee (\)prg\(_\$(b,\Gamma_i,f_{1}) \wedge f_{1} \fut f_{2}) \)

\(\xfltl\) Fragment in Practice

  • In practice we use a no funny business fragment
  • \(F ::= \$ \;|\; \nxt F \;|\; M \lor F \;|\; F \fut M \;|\; F \land F \)
  • Where \(M\) is modality free propositional formula
  • Operator \(\$\) cannot be negated
  • To allocate reward \(\langle f, r \rangle\) at \(\Gamma_i\)

Rew\((\Gamma_i,f) \)\(=\) \(\top\) iff prg\(_{\$}(\bot,\Gamma_i,f)=\bot\)
\(\dollar\)prg\((\Gamma_i,f)\) \(=\) prg\(_{\$}(\)Rew\((\Gamma_i,f),\Gamma_i,f)\)

  • prg\(_{\$}\) is our special version of formula progression
  • prg\(_{\$}\) first argument is true iff \(\Gamma(i) \in B_f\)
  • If \(\Gamma(i) \in B_f\) then allocate \(r\) at \(\Gamma_i\)

NMRDP Solution Procedures

  • Using \( \pltl \) we have offline approaches
    • Pre-compute minimum-size equivalent MDP
    • Compute factored---Algebraic Decision Diagram---representation of value function
  • Using \(\xfltl\) we have online approaches
    • Interleaved expansion of equivalent MDP and calculation of policy
    • e.g. execute LAO\(^*\) for equivalent MDP
    • Calculates optimal policy without considering all states

Example NMRDP


  • Associate numeric rewards with LTL formulae
\(\langle \fbx (heads~\implies~\$), 1.0 \rangle \)
\( \langle \fbx ((heads \land (\nxt heads) \land (\nxt^{2} \lnot heads)) \implies \nxt^{2} \$), 1.0 \rangle \)
\( \langle \fbx (((\nxt^{\{and, 3\}} \lnot heads) \land \nxt^{4} heads) \implies \nxt^{4} \$), 1.0 \rangle \)

  • Dynamics are described using a stochastic automaton
NMRDP

An Equivalent MDP

Equivalent

Behaviours and Languages

  • Let \(\alpha\) be an alphabet symbol -- e.g. a state is a "letter"
  • A regular expression is generated by the following
  • regx \(::= \alpha \;|\;\) regx \(\cup\) regx \(\;|\;\) regx \(\circ\) regx \(\;|\;\) regx\(^*\)
    • \(\cup\) is union
    • \(\circ\) is concatenation
    • \(*\) is Kleene closure
  • The corresponding language is called star-free, or equivalently non-counting, if it is generated by:
  • regx \(::= \alpha \;|\;\) regx \(\cup\) regx \(\;|\;\) regx \(\circ\) regx \(\;|\;\) regx\(^{-1}\)
    • regx\(^{-1}\) is complement

Star-Free Expressions Describe
Non-Counting Languages

  • Write \(\alphabet\) for the alphabet
  • \(\alphabet^*\) is the Kleene closure of the alphabet
  • \(y^n\) is shorthand for \(n\) concatenated occurrences of \(y\)

  • Definition:
    A language \(L\) is noncounting iff there exists a finite \(n\) s.t. \(\forall x, y, z \in \alphabet^*\),
    \(x \circ y^n \circ z \in L\) iff \(x \circ y^{n + 1} \circ z \in L\).

Existing LTL Formalisms are
Expressively Weak

  • Take behaviours to be languages over states
  • \(\pltl\) can only express non-counting behaviours
  • \(\xfltl\) can only express non-counting behaviours

New More Expressive Language
Called \(\mxfltl\)

  • Consider the logic generated by the grammar:
  • \(F ::= M \;|\; \$_i \;|\; \top \;|\; \bot \;|\; F \to F \;|\; \lnot F \;|\; \nxt F \;|\; F \fut F \)
  • $i\in N$ is a positive number,
    • i.e. multiple "reward" symbols
  • We can write \(\$_1 \to \nxt \$_2\) -- i.e. think concatenation
    • Reward now implies reward later
  • union \((\$_1 \lor \$_2 ) \to \$_3\)
  • intersection \( (\$_1 \land \$_2 ) \to \$_3\)
  • complement $(\lnot \$_1) \to \$_3 $

Semantics for \(\mxfltl\)

  • Consequence is indexed using a list of behaviours \(\mathcal{B}\)
  • \((\Gamma, i) \models_{\mathcal{B}} \$_j \) IFF \( \Gamma(i) \in \mathcal{B}[j]\)
  • Semantics amenable to online solution procedures:
  • \(\mathcal{B}^*_f \) \(=_{df}\) \( \{\mathcal{B} \;|\; \models_{\mathcal{B}} f \}\)
    \(\mathcal{B}_f[0]\) \(=\) \(\bigcap_{\mathcal{B} \in \mathcal{B}^*_f} \mathcal{B}[0] \)
    \(\mathcal{B}_f[i]\) \(=\) \(\bigcap_{\mathcal{B} \in \mathcal{B}^*_f, \mathcal{B}[0]=\mathcal{B}_f[0],..,B[i-1]=\mathcal{B}_f[i-1]} \mathcal{B}[i] \)
  • Decide allocation to lower index terms earlier

New Variant can Count


\(M \to \$_i\) letter-rule
\(F_j \land \fbx (\$_j \lor \$_i) \;\; \{j < i\}\) complement
\(F_j \land F_k \land \fbx ( (\$_j \land \$_k) \to \$_i ) \;\; \{j < k < i\} \) intersection
\(F_j \land F_k \land \fbx ( (\$_j \lor \$_k) \to\ \$_i ) \;\; \{j < k < i\}\) union
\(F_i \land F_{i+1} \land \ldots \land F_n \land \fbx (\$_n \to \)
\( ( \nxt (F_{n+1}\land F_{n+2}\land \ldots \land F_{n+m} ) ) )\) concatenation
\(F_i \land F_{i+1}\land \ldots \land F_n \land \fbx (\$_n \to \)
\((\$_{n+1} \land \nxt (F_i\land F_{i+1}\land \ldots\land F_n) ))\) Closure

Conclusions

  • NMRDP solved using a translation to an equivalent MDP
  • Rewarding behaviours described using \(\pltl\) & \(\xfltl\)
  • Both \(\pltl\) and \(\xfltl\) are expressively weak
    • Non-counting languages only
  • Developed new more expressive language \(\mxfltl\)
    • Regular languages
  • Model checking \(\mxfltl\) has same runtime as \(\xfltl\)
  • Note, other LTL-based expressive languages
    • Quantified Propositional Temporal Logics
    • Extended Temporal Logic