User:Hosiet/沙盒/随机梯度下降法

随机梯度下降法（Stochastic Gradient Descent，英文中常缩写为SGD），又被称为增量梯度下降法，是一个在最小化目标函数时梯度下降优化（英语：gradient descent optimization）方法中的随机逼近（英语：stochastic approximation）方法，且被写为可微函数的和的形式。换言之，随机梯度下降方法试图结合迭代方法与随机性来找到最小值或最大值。

背景[编辑]

统计估计和机器学习领域都会考虑对具有如下和的形式的目标函数进行极小化的问题：

Q(w)=\sum _{i=1}^{n}Q_{i}(w),

其中对 $Q(w)$ 进行极小化的参数 $w$ 需要进行估计。每一个被加项函数 $Q_{i}$ 通常与在第 $i$ 个用来进行训练的数据集的观测相关联。

在经典统计学中，最小化和函数问题在最小二乘法和极大似然估计中被（独立地）提出。一类一般的对和的极小值的估计量被称为M-estimator（英语：M-estimator）。然而在统计学中长期以来存在共识，即 However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation.^[1] Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization: In this case, $Q_{i}(w)$ is the value of the loss function at $i$ -th example, and $Q(w)$ is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations :

w:=w-\eta \nabla Q(w)=w-\eta \sum _{i=1}^{n}\nabla Q_{i}(w),

where $\eta$ is a step size (sometimes called the learning rate in machine learning).

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.^[2]

迭代方法[编辑]

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

在随机（或称“在线”）梯度下降中， $Q(w)$ 的真实梯度 is approximated by a gradient at a single example:

w:=w-\eta \nabla Q_{i}(w).

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

使用伪代码，随机梯度下降法可以使用下面的方式展现：

Choose an initial vector of parameters $w$ and learning rate $\eta$ .
Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
- For $\!i=1,2,...,n$ $\!i=1,2,...,n$ , do:
  - $\!w:=w-\eta \nabla Q_{i}(w).$

A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a "mini-batch") at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates $\eta$ decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.^[3] ^[4] This is in fact a consequence of the Robbins-Siegmund theorem.^[5]

示例[编辑]

Let's suppose we want to fit a straight line $y=\!w_{1}+w_{2}x$ to a training set of two-dimensional points $(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ using least squares. The objective function to be minimized is:

Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.

The last line in the above pseudocode for this specific problem will become:

{\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}:={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.

应用[编辑]

Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including (linear) support vector machines, logistic regression (see, e.g., Vowpal Wabbit) and graphical models.^[6] When combined with the backpropagation algorithm, it is the de facto standard algorithm for training artificial neural networks.^[7]

Stochastic gradient descent competes with the L-BFGS algorithm,^{[來源請求]} which is also widely used. Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE.^[8]

Another popular stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.

Extensions and variants[编辑]

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function $η t$ of the iteration number $t$ , giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on $k$ -means clustering.^[9]

Momentum[编辑]

Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' seminal paper on backpropagation learning.^[10] stochastic gradient descent with momentum remembers the update $Δ w$ at each iteration, and determines the next update as a convex combination of the gradient and the previous update:^[11]

\Delta w:=\eta \nabla Q_{i}(w)+\alpha \Delta w

w:=w-\Delta w

or as a mathematically equivalent formulation:^[12]

\Delta w:=-\eta \nabla Q_{i}(w)+\alpha \Delta w

w:=w+\Delta w

that leads to:

w:=w-\eta \nabla Q_{i}(w)+\alpha \Delta w

where the parameter $w$ which minimizes $Q(w)$ is to be estimated, and $\eta$ is a step size (sometimes called the learning rate in machine learning).

The name momentum stems from an analogy to momentum in physics: the weight vector, thought of as a particle traveling through parameter space,^[10] incurs acceleration from the gradient of the loss ("force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully for several decades.^[13]

Averaging[编辑]

Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of^[14]

{\bar {w}}={\frac {1}{t}}\sum _{i=0}^{t-1}w_{i}

.

When optimization is done, this averaged parameter vector takes the place of $w$ .

AdaGrad[编辑]

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011.^[15]^[16] Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.^[15] It still has a base learning rate $η$ , but this is multiplied with the elements of a vector ${G j, j}$ which is the diagonal of the outer product matrix.

G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}

where $g_{\tau }=\nabla Q_{i}(w)$ , the gradient, at iteration $τ$ . The diagonal is given by

G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}

.

This vector is updated after every iteration. The formula for an update is now

w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g

^[a]

or, written as per-parameter updates,

w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.

Each ${G (i, i)}$ gives rise to a scaling factor for the learning rate that applies to a single parameter $w i$ . Since the denominator in this factor, ${\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}$ is the ℓ₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.^[13]

While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.^[17]

RMSProp[编辑]

RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. the idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.^[18] So, first the running average is calculated in terms of means square,

v(w,t):=\gamma v(w,t-1)+(1-\gamma )(\nabla Q_{i}(w))^{2}

where, $\gamma$ is the forgetting factor.

And the the parameters are updated as,

w:=w-{\frac {\eta }{\sqrt {v(w,t)}}}\nabla Q_{i}(w)

RMSProp has shown excellent adaptation of learning rate in different applications.

Adam[编辑]

'Adam' ^[19] (for Adaptive Moment Estimation) is an update to RMSProp optimizer. In this running average of both the gradients and their magnitudes are used. The three equations that define this optimizer is as follows,

m(w,t):=\gamma _{1}*m(w,t-1)+(1-\gamma _{1})*\nabla Q_{i}(w)

v(w,t):=\gamma _{2}*v(w,t-1)+(1-\gamma _{2})*(\nabla Q_{i}(w))^{2}

{\hat {m}}(w,t):={\frac {m(w,t)}{(1-\gamma _{1}^{t})}}

{\hat {v}}(w,t):={\frac {v(w,t)}{(1-\gamma _{2}^{t})}}

w:=w-{\frac {\eta }{{\sqrt {{\hat {v}}(w,t)}}+\epsilon }}*{\hat {m}}(w,t)

where, $\gamma _{1}$ and $\gamma _{2}$ are two forgetting factors of the algorithm, respectively for gradients and magnitude of gradients.

注释[编辑]

^ $\circ$ is the element-wise product.

参见[编辑]

线性分类器
在线机器学习（英语：Online machine learning）

参考[编辑]

^ Ferguson, Thomas S. An inconsistent maximum likelihood estimate. Journal of the American Statistical Association. 1982, 77 (380): 831–834. JSTOR 2287314. doi:10.1080/01621459.1982.10477894.
^ Bottou, Léon; Bousquet, Olivier. The Tradeoffs of Large Scale Learning. Advances in Neural Information Processing Systems 20: 161–168. 2008.
^ Bottou, Léon. Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks. Cambridge University Press. 1998. ISBN 978-0-521-65263-6.
^ Kiwiel, Krzysztof C. Convergence and efficiency of subgradient methods for quasiconvex minimization 90 (1). Berlin, Heidelberg: Springer. 2001: 1–25. ISSN 0025-5610. MR 1819784. doi:10.1007/PL00011414. |journal=被忽略 (帮助)
^ Robbins, Herbert; Siegmund, David O. A convergence theorem for non negative almost supermartingales and some applications. Rustagi, Jagdish S. (编). Optimizing Methods in Statistics. Academic Press. 1971.
^ Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.
^ Yann Lecun et. al., Efficient Backprop Triks
^ Avi Pfeffer. CS181 Lecture 5 — Perceptrons (PDF). Harvard University.
^ Cited by Darken, Christian; Moody, John. Fast adaptive k-means clustering: some empirical results. Int'l Joint Conf. on Neural Networks (IJCNN). IEEE. 1990.
^ ^10.0 ^10.1 Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. Learning representations by back-propagating errors. Nature. 8 October 1986, 323 (6088): 533–536. doi:10.1038/323533a0.
^ Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. Sanjoy Dasgupta and David Mcallester , 编. On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13) 28. Atlanta, GA: 1139–1147. June 2013 [14 January 2016].
^ Sutskever, Ilya. Training recurrent neural networks (PDF) (学位论文). University of Toronto: 74. 2013.
^ ^13.0 ^13.1 Zeiler, Matthew D. ADADELTA: An adaptive learning rate method. 2012. arXiv:1212.5701 .
^ Polyak, Boris T.; Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization. 1992, 30 (4): 838–855.
^ ^15.0 ^15.1 Duchi, John; Hazan, Elad; Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization (PDF). JMLR. 2011, 12: 2121–2159.
^ Perla, Joseph. Notes on AdaGrad (PDF). 2014.
^ Gupta, Maya R.; Bengio, Samy; Weston, Jason. Training highly multiclass classifiers (PDF). JMLR. 2014, 15 (1): 1461–1492.
^ Tieleman, Tijmen and Hinton, Geoffrey (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning
^ Diederik, Kingma; Ba, Jimmy. Adam: A method for stochastic optimization. 2014. arXiv:1412.6980 .

延伸阅读[编辑]

Bertsekas, Dimitri P., Nonlinear Programming 2nd, Cambridge, MA.: Athena Scientific, 1999, ISBN 1-886529-00-0 .

Bertsekas, Dimitri, Convex Analysis and Optimization, Athena Scientific, 2003 .

Bottou, Léon, Stochastic Learning, Advanced Lectures on Machine Learning, LNAI 3176, Springer: 146–168, 2004, ISBN 978-3-540-23122-6 .

Davidon, W.C., New least-square algorithms, Journal of Optimization Theory and Applications, 1976, 18 (2): 187–197, MR 0418461, doi:10.1007/BF00935703 .

Duda, Richard O.; Hart, Peter E.; Stork, David G., Pattern Classification 2nd, Wiley, 2000, ISBN 978-0-471-05669-0 .

Kiwiel, Krzysztof C., Convergence of approximate and incremental subgradient methods for convex optimization, SIAM Journal of Optimization, 2004, 14 (3): 807–840, MR 2085944, doi:10.1137/S1052623400376366 . (Extensive list of references)

Spall, James C., Introduction to Stochastic Search and Optimization, Wiley, 2003, ISBN 978-0-471-33052-3 .

软件[编辑]

sgd: an LGPL C++ library which uses stochastic gradient descent to fit SVM and conditional random field models.
CRF-ADF A C# toolkit of stochastic gradient descent and its feature-frequency-adaptive variation for training conditional random field models.
Vowpal Wabbit: BSD licence, fast scalable learning by John Langford and others. Includes several stochastic gradient descent variants. Source repository on github

外部链接[编辑]

Using stochastic gradient descent in C++, Boost, Ublas for linear regression

[17] $\circ$ is the element-wise product.

[1] Ferguson, Thomas S. An inconsistent maximum likelihood estimate. Journal of the American Statistical Association. 1982, 77 (380): 831–834. JSTOR 2287314. doi:10.1080/01621459.1982.10477894.

[2] Bottou, Léon; Bousquet, Olivier. The Tradeoffs of Large Scale Learning. Advances in Neural Information Processing Systems 20: 161–168. 2008.

[3] Bottou, Léon. Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks. Cambridge University Press. 1998. ISBN 978-0-521-65263-6.

[4] Kiwiel, Krzysztof C. Convergence and efficiency of subgradient methods for quasiconvex minimization 90 (1). Berlin, Heidelberg: Springer. 2001: 1–25. ISSN 0025-5610. MR 1819784. doi:10.1007/PL00011414. |journal=被忽略 (帮助)

[5] Robbins, Herbert; Siegmund, David O. A convergence theorem for non negative almost supermartingales and some applications. Rustagi, Jagdish S. (编). Optimizing Methods in Statistics. Academic Press. 1971.

[6] Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.

[7] Yann Lecun et. al., Efficient Backprop Triks

[8] Avi Pfeffer. CS181 Lecture 5 — Perceptrons (PDF). Harvard University.

[9] Cited by Darken, Christian; Moody, John. Fast adaptive k-means clustering: some empirical results. Int'l Joint Conf. on Neural Networks (IJCNN). IEEE. 1990.

[Rumelhart1986-10] 10.0 ^10.1 Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. Learning representations by back-propagating errors. Nature. 8 October 1986, 323 (6088): 533–536. doi:10.1038/323533a0.

[Sutskever2013-11] Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. Sanjoy Dasgupta and David Mcallester , 编. On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13) 28. Atlanta, GA: 1139–1147. June 2013 [14 January 2016].

[SutskeverPhD-12] Sutskever, Ilya. Training recurrent neural networks (PDF) (学位论文). University of Toronto: 74. 2013.

[Zeiler_2012-13] 13.0 ^13.1 Zeiler, Matthew D. ADADELTA: An adaptive learning rate method. 2012. arXiv:1212.5701 .

[14] Polyak, Boris T.; Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization. 1992, 30 (4): 838–855.

[duchi-15] 15.0 ^15.1 Duchi, John; Hazan, Elad; Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization (PDF). JMLR. 2011, 12: 2121–2159.

[16] Perla, Joseph. Notes on AdaGrad (PDF). 2014.

[18] Gupta, Maya R.; Bengio, Samy; Weston, Jason. Training highly multiclass classifiers (PDF). JMLR. 2014, 15 (1): 1461–1492.

[19] Tieleman, Tijmen and Hinton, Geoffrey (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning

[Adam2014-20] Diederik, Kingma; Ba, Jimmy. Adam: A method for stochastic optimization. 2014. arXiv:1412.6980 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[a]

[17]

[18]

[19]