Outline for CS282r lecture (03/12/15): LSPI algorithm

Summary Portion

Minjae: Brief overview of the contributions of paper, summary of decomposition of the Q into k basis functions, quick derivation of the weight.
Dustin: LSQ, asymptotic, incremental update, LSPI.

Some things to reinforce:

That it's linear with respect to the parameters and not necessarily the data.
That it calculates parameters using the least squares approach (in other words, decision theory), rather than doing MLE or MAP or whatever.
That its primary advantages are: intuition, simplicity, good theoretical properties on convergence, asymptotics, etc.

In general with these function approximations there is always something more involved than doing the standard Bellman update in Q-learning. What is (and isn't) model-free about this approach?
How does this compare to standard OLS, e.g., what makes this a least squares approach?
Following 2, what are the assumptions being made in order to obtain good results using a linear model?
What makes this policy iteration, and what makes it different? (somewhat broad and may need to be more specific for actual direction)

Because of the noisy regression, the policy iteration is not guaranteed to be monotonically increasing (I think??). However, it should be somewhat monotonically increasing, plus or minus some noise that has lower variance than value iteration.

Discussion of the derivation of the weights using the commutative diagram.
General intuition of doing least squares in RL using the bicycle as the prime example.
How does this paper's approach compare to using least squares as the regression in the fitted Q iteration algorithm?
- Essentially, it should be a matter of difference between value iteration and policy iteration I believe.