pyrieef.planning package¶

Submodules¶

pyrieef.planning.algorithms module¶

pyrieef.planning.algorithms.best_policy(mdp, U)¶: Given an MDP and a utility function U, determine the best policy, as a mapping from state to action. (Equation 17.4)

pyrieef.planning.algorithms.expected_utility(a, s, U, mdp)¶: The expected utility of doing a in state s, according to the MDP and U.

pyrieef.planning.algorithms.policy_evaluation(pi, U, mdp, k=20)¶: Return an updated utility mapping U from each state in the MDP to its utility, using an approximation (modified policy iteration).

pyrieef.planning.algorithms.policy_iteration(mdp)¶: Solve an MDP by policy iteration [Figure 17.7]

pyrieef.planning.algorithms.value_iteration(mdp, epsilon=0.001)¶: Solving an MDP by value iteration. [Figure 17.4]

pyrieef.planning.common_imports module¶

pyrieef.planning.mdp module¶

Markov Decision Processes (Chapter 17) First we define an MDP, and the special case of a GridMDP, in which states are laid out in a 2-dimensional grid. We also represent a policy as a dictionary of {state: action} pairs, and a Utility function as a dictionary of {state: number} pairs. We then define the value_iteration and policy_iteration algorithms. >>> pi = best_policy(sequential_decision_environment, value_iteration(sequential_decision_environment, .01)) >>> sequential_decision_environment.to_arrows(pi) [[‘>’, ‘>’, ‘>’, ‘.’], [‘^’, None, ‘^’, ‘.’], [‘^’, ‘>’, ‘^’, ‘<’]] >>> from utils import print_table >>> print_table(sequential_decision_environment.to_arrows(pi)) > > > . ^ None ^ . ^ > ^ < >>> print_table(sequential_decision_environment.to_arrows(policy_iteration(sequential_decision_environment))) > > > . ^ None ^ . ^ > ^ <

class pyrieef.planning.mdp.GridMDP(grid, terminals, init=(0, 0), gamma=0.9)¶

Bases: pyrieef.planning.mdp.MDP

A two-dimensional grid MDP, as in [Figure 17.1].

All you have to do is specify the grid as a list of lists of rewards; use None for an obstacle (unreachable state). Also, you should specify the terminal states. An action is an (x, y) unit vector; e.g. (1, 0) means move east.

T(state, action)¶: Transition model. From a state and an action, return a list of (probability, result-state) pairs.

calculate_T(state, action)¶

go(state, direction)¶: Return the state that results from going in this direction.

to_arrows(policy)¶

to_grid(mapping)¶: Convert a mapping from (x, y) to v into a [[…, v, …]] grid.

class pyrieef.planning.mdp.MDP(init, actlist, terminals, transitions=None, reward=None, states=None, gamma=0.9)¶

Bases: object

A Markov Decision Process, defined by an initial state, transition model, and reward function.

We also keep track of a gamma value, for use by algorithms. The transition model is represented somewhat differently from the text. Instead of P(s’ | s, a) being a probability number for each state/state/action triplet, we instead have T(s, a) return a list of (p, s’) pairs. We also keep track of the possible states, terminal states, and actions for each state. [page 646]

R(state)¶: Return a numeric reward for this state.

T(state, action)¶: Transition model. From a state and an action, return a list of (probability, result-state) pairs.

actions(state)¶: Return a list of actions that can be performed in this state. By default, a fixed list of actions, except for terminal states. Override this method if you need to specialize by state.

check_consistency()¶

get_states_from_transitions(transitions)¶

class pyrieef.planning.mdp.MDP2(init, actlist, terminals, transitions, reward=None, gamma=0.9)¶