Bayle Shanks's website: work-storage-presentations-sejnClass-sejnClassTalk foragingAndBasalGanglia

\documentclass{seminar}

\newpagestyle{mypagestyle}{}{\hbox{Bayle Shanks}} \pagestyle{mypagestyle}

\begin{document}

Preliminaries

Reinforcement Learning

"The basic paradigm of reinforcement learning is as follows: The learning agent observes an input state or input pattern, it produces an output signal (most commonly thought of as an "action" or "control signal"), and then it receives a scalar "reward" or "reinforcement" feedback signal from the environment indicating how good or bad its output was.

The goal of learning is to generate the optimal actions leading to maximal reward.

In many cases the reward is also delayed (i.e., is given at the end of a long sequence of inputs and outputs). In this case the learner has to solve what is known as the "temporal credit assignment" problem (i.e., it must figure out how to apportion credit and blame to each of the various inputs and outputs leading to the ultimate final reward signal)." \begin{scriptsize}(from http://www.research.ibm.com/massive/tdl.html)\end{scriptsize}

Element of Reinforcement Learning

State
- Simplified state in our case
Reward function
Value function

Goals of Reinforcement Learning

Estimated value function
Policy

Estimated reward is -V'

% \includegraphics{2127.eps}{.1}

party example

Equation defining V(t)

V(t) = E[r(t) + \gamma V(t + 1)]

Rearrange to get equation for \delta(t):

0 = r(t) + \gamma \hat{V}(t + 1) - \hat{V}(t) \delta(t) = r(t) + \gamma \hat{V}(t + 1) - \hat{V}(t)

Sequences

Open questions

\end{document}