Stochastic gradient descent learning note

The stochastic gradient descent method is commonly used in machine learning. Understanding this method well will help us betterly build a comprehensive architecture of machine learning in the brain.

Considering the power of analogy, we begin with the two-dimension situation. Supposing a 2-D math function:
$$
f(x,y) = x^2 + y^2 + 1
$$
and a point $P_0(1, 2)$. We want to find the fastest degradient path from P.

We frist define a unit vector $\boldsymbol{l_0} = (\cos\alpha_0, \sin\alpha_0)$ that indicates a direction.

The direviative along this direction yields:

$$\begin{align}\displaystyle
\frac{\partial f}{\partial\boldsymbol{l_0}}
& = \lim_{\left|\left|P_0P\right|\right|\to0}\frac{f(P) - f(P_0)}{\left|\left|\overrightarrow{P_0P}\right|\right|} \
& = \lim_{\left|\left|P_0P\right|\right|\to0} \frac{\frac{\partial f}{\partial x}\Delta x + \frac{\partial f}{\partial y}\Delta y + o(\left|\left|\overrightarrow{P_0P}\right|\right|)}{\left|\left|\overrightarrow{P_0P}\right|\right|} \
& = \frac{\partial f}{\partial x} \cos \alpha_0 + \frac{\partial f}{\partial y} \sin \alpha_0 \
& = 3 \cos \alpha_0 + 5 \sin \alpha_0 \
& \leq \sqrt{3^2 + 5^2} = \sqrt{34} \
\end{align
}$$

when
$$\displaystyle\left{\begin{align}
\cos\alpha_0 &= \frac{f_x}{\sqrt{f_x^2 + f_y^2}} = \frac{3}{\sqrt{34}} \
\sin \alpha_0 &= \frac{f_y}{\sqrt{f_x^2 + f_y^2}} = \frac{5}{\sqrt{34}}
\end{align
}\right.$$
where $f_x = \frac{\partial f}{\partial x}$, $f_y = \frac{\partial f}{\partial y}$

So if we want to find the minimum of this function from point $P_0(x_0, y_0)$

$$P_1 = P_0 - s \boldsymbol{l}_0 $$

We call $s$ as the step. that is

$$\begin{align}
x_1 &= x_0 - s \cos \alpha_0 \
y_1 &= y_0 - s \sin \alpha_0 \
\cos\alpha_1 &= \left.\frac{f_x}{\sqrt{f_x^2 + f_y^2}}\right|_{P_1} \
\sin\alpha_1 &= \left.\frac{f_y}{\sqrt{f_x^2 + f_y^2}}\right|_{P_1}
\end{align
}$$

Likewise, we can write an algorithm for it:

Given x, y, f(x,y) = x*x + y*y + 1
do:
    fx := 2 x + 1
    fy := 2 y + 1
    tmp := sqrt(fx^2 + fy^2)
    cos_a = fx / tmp
    sin_a = fy / tmp
    x := x - s cos_a
    y := y - s sin_a
while (f is decreasing)

I did a Python experiment based on this algorithm, and give the result as a table

stepsize 0.1 0.01 0.001
iteration times 23 224 2230
optimization value 1.0302809 1.0294120 1.0294120

You see, the convergence comes very fast within several steps. But, when stepsize increases 10 times and the iterations times increases 10 times accordingly, the optimation effects don’t have remarkable progress. And finally there are about 0.02 difference away from the theoretical optimization result.

As for improving this algorithm in personal opininum, first, the step can varies according to the magnitude of gradient, that is, when gradient is big, step is big and vice versa. Secondly, upon reach the last part of iteration, not use the directly direction, but try to deviate a bit, this might receive a better result and also has a digree of ability to jump out of local optimization.

Video quality assessment -- Step by step Pandas Overview

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×