Author: Gareth McCaughan
Date: 09:18:39 10/21/01
Go up one level in this thread
James Swafford wrote:
> First, it's obvious that the evaluation parameters must somehow
> be 'vectorized', or placed in a vector. OK. Additionally,
> eval(pos,w) must be a differentiable function of its parameters
> w (w1...wk).
>
> Exactly what does that mean? I've had some calculus, and I've
> had some linear algebra, but that eludes me.
If you've done some calculus, you'll know the notion of
"derivative", at least for functions of one variable.
Not all functions *have* derivatives. For instance,
f(x) = abs(x) doesn't have a derivative when x=0. f(x)=floor(x)
doesn't have a derivative -- it's not even *continuous* --
when x is an integer.
What's required here is that your evaluation function,
if you fix the position it's evaluating and vary the
weights, is a differentiable function in each of the
weights. In other words, when you vary any one of the
weights by very small amounts, the function behaves
linearly.
Any function you use in practice will be differentiable,
or at least will be an approximation of something that
is differentiable.
Note that it's *not* required to be a differentiable
function of the *position* (whatever that might mean).
What matters is that it behaves nicely as the weights
vary.
> Next - at the end of the game, TDLeaf will update the parameter
> vector. It does this, if I understand the paper correctly,
> by the following:
>
> w = w + lr * s * t,
>
> where lr is a scalar for learning rate, s is the sum of
> the vectors of partial derivatives of the evaluation at each
> position with respect to its parameters (w1....wk), and I don't
> want to get into t right now for fear of complicating my question.
>
> How do I compute a vector of partial derivatives of the eval
> at any position with respect to its parameter weights?
Remi Coulom made one suggestion, which will probably work
OK, but it doesn't feel like the Right Thing to do. It would,
I think, be better -- albeit more work, and easier to
mess up -- to compute the derivatives exactly.
I should begin by explaining the terminology. "Partial
derivative" just means: hold all but one of your variables
constant, so that the function you're looking at (in this
case, that's what your eval returns for a fixed position,
in terms of the weights) becomes a function of just one
variable. (That is: keep all but one weight constant; then
the only thing for the evaluation to depend on is the one
remaining weight.) Then calculate the derivative of that
function. (That is: how fast does the eval change as that
one weight changes?) What you've then got is the "partial
derivative with respect to the one variable you didn't fix".
[Skip the next paragraph if you don't care why they're
called "partial".]
Obviously there's one of these for each variable. That is,
one for each weight. Between them, they make up what's called
the "gradient" of the function. (If your function depends
on only two variables, and you plot a 3d "graph" -- a
surface whose height at a given point is the value of the
function -- then the gradient vector always points uphill
in the steepest direction at any point; and its size
indicates just how steep that is.) This gradient vector
is the nearest analogue to the derivative of a function
of one variable; the "partial derivatives" are parts of it.
OK. How to calculate these things? Well, if you have a
really simple evaluation function then life is very easy.
Specifically, if what you do is to look at the position
and calculate a bunch of quantities Q1, Q2, ..., Qn, and
then combine them linearly (eval := w1*Q1 + ... + wn*Qn)
then the gradient (i.e., vector of partial derivatives)
is just (Q1,Q2,...,Qn). This applies no matter how
complicated the definitions of Q1,...,Qn are, provided
they don't involve the weights. (In other words, provided
they don't involve anything you're trying to train.)
If you're doing something that doesn't fit into this
scheme, then you have to work a little harder.
Approach 1. Do what Remi suggested: to calculate the
j'th partial derivative, do your evaluation with some
small value added to the j'th weight and without,
then subtract and divide by the "small value". To get
the best possible accuracy, you want the small value
to be on the order of sqrt(eps) times the weight it
affects, where eps is your "machine epsilon". Hmm. Or
maybe you just want the value to be on the order of
sqrt(eps). It depends a bit on the function. I think
I suggest using h*(1+w), where w is the weight in question
and h is sqrt(eps).
"Machine epsilon? What on earth is that?" I hear you cry.
Your machine epsilon is the smallest value h such that
1+h is not equal to 1 on your machine. If you're using
IEEE double-precision numbers then it's about 10^-16.
Approach 2. Go through your evaluation very carefully,
from start to end, and every time you do any calculation
work out what the partial derivatives of the result of
that calculation should be, as functions of the weights.
This is less painful than it sounds, because usually
most of them will be 0. Combine results according to the
following rules:
- If none of the things you're calculating with depend
on one of the weights, the result doesn't either, so
the partial derivative is 0.
- When you add things up, you add their partial derivatives.
When you subtract things, you subtract their partial
derivatives.
- When you multiply two things (x*y, say) and x doesn't
depend on one of the weights, the p.d. of x*y is
x * p.d.(y).
- When you multiply two things which both depend on the
same weight, the partial derivatives combine like
this: p.d.(x*y) = x*p.d.(y) + y*p.d.(x).
- There are a whole lot more rules, for division and
powers and logarithms and so on, but there's a good
chance that I've already told you all you need.
Approach 3. There is computer software for doing this sort
of thing automatically (it's called "automatic differentiation").
Unfortunately, most a.d. software is designed for scientists and
expects you to be programming in Fortran. What it typically does
is: you feed it the definition of a function, and it gives you
the definition of a function that does the same thing except that
it also calculates derivatives.
Approach 1 will probably be the slowest to run and the least
accurate, but it's very easy to implement. Approach 2 will
run fast and give accurate answers, but implementing it will
be a pain and the hassle may discourage you from changing
the evaluation function in future. Approach 3 will be quite
efficient (probably less so than approach 2 if done well)
and about as accurate as approach 2. It's a lot less work
if you can find AD software that does what you need; otherwise
you'd have to write your own, which I conjecture you wouldn't
enjoy much.
--
g
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.