pyrieef.planning package¶
Submodules¶
pyrieef.planning.algorithms module¶
-
pyrieef.planning.algorithms.
best_policy
(mdp, U)¶ Given an MDP and a utility function U, determine the best policy, as a mapping from state to action. (Equation 17.4)
-
pyrieef.planning.algorithms.
expected_utility
(a, s, U, mdp)¶ The expected utility of doing a in state s, according to the MDP and U.
-
pyrieef.planning.algorithms.
policy_evaluation
(pi, U, mdp, k=20)¶ Return an updated utility mapping U from each state in the MDP to its utility, using an approximation (modified policy iteration).
-
pyrieef.planning.algorithms.
policy_iteration
(mdp)¶ Solve an MDP by policy iteration [Figure 17.7]
-
pyrieef.planning.algorithms.
value_iteration
(mdp, epsilon=0.001)¶ Solving an MDP by value iteration. [Figure 17.4]
pyrieef.planning.common_imports module¶
pyrieef.planning.mdp module¶
Markov Decision Processes (Chapter 17) First we define an MDP, and the special case of a GridMDP, in which states are laid out in a 2-dimensional grid. We also represent a policy as a dictionary of {state: action} pairs, and a Utility function as a dictionary of {state: number} pairs. We then define the value_iteration and policy_iteration algorithms. >>> pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .01)) >>> sequential_decision_environment.to_arrows(pi) [[‘>’, ‘>’, ‘>’, ‘.’], [‘^’, None, ‘^’, ‘.’], [‘^’, ‘>’, ‘^’, ‘<’]] >>> from utils import print_table >>> print_table(sequential_decision_environment.to_arrows(pi)) > > > . ^ None ^ . ^ > ^ < >>> print_table(sequential_decision_environment.to_arrows(policy_iteration(sequential_decision_environment))) > > > . ^ None ^ . ^ > ^ <
-
class
pyrieef.planning.mdp.
GridMDP
(grid, terminals, init=(0, 0), gamma=0.9)¶ Bases:
pyrieef.planning.mdp.MDP
A two-dimensional grid MDP, as in [Figure 17.1].
All you have to do is specify the grid as a list of lists of rewards; use None for an obstacle (unreachable state). Also, you should specify the terminal states. An action is an (x, y) unit vector; e.g. (1, 0) means move east.
-
T
(state, action)¶ Transition model. From a state and an action, return a list of (probability, result-state) pairs.
-
calculate_T
(state, action)¶
-
go
(state, direction)¶ Return the state that results from going in this direction.
-
to_arrows
(policy)¶
-
to_grid
(mapping)¶ Convert a mapping from (x, y) to v into a [[…, v, …]] grid.
-
-
class
pyrieef.planning.mdp.
MDP
(init, actlist, terminals, transitions=None, reward=None, states=None, gamma=0.9)¶ Bases:
object
A Markov Decision Process, defined by an initial state, transition model, and reward function.
We also keep track of a gamma value, for use by algorithms. The transition model is represented somewhat differently from the text. Instead of P(s’ | s, a) being a probability number for each state/state/action triplet, we instead have T(s, a) return a list of (p, s’) pairs. We also keep track of the possible states, terminal states, and actions for each state. [page 646]
-
R
(state)¶ Return a numeric reward for this state.
-
T
(state, action)¶ Transition model. From a state and an action, return a list of (probability, result-state) pairs.
-
actions
(state)¶ Return a list of actions that can be performed in this state. By default, a fixed list of actions, except for terminal states. Override this method if you need to specialize by state.
-
check_consistency
()¶
-
get_states_from_transitions
(transitions)¶
-
-
class
pyrieef.planning.mdp.
MDP2
(init, actlist, terminals, transitions, reward=None, gamma=0.9)¶ Bases:
pyrieef.planning.mdp.MDP
Handles terminal states, and transitions to and from terminal states better.
-
T
(state, action)¶ Transition model. From a state and an action, return a list of (probability, result-state) pairs.
-