Deep Learning Optimization Glossary

A small list of concepts related to deep learning optimization.

February 20, 2023

AdaGrad

AdaGrad belongs to a class of gradient descent algorithms with adaptive learning rates. The algorithm is as follows:

Choose a learning rate \(\epsilon\) and initial parameters \(\theta_0\). Do the following until a stopping criterion is met:
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \[g_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t), y^{(i)})\]
3. Aggregate the outer product of all the previous gradient estimates \(G_t = \sum_{i=0}^{t} g_i g_i^T\).
4. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_t - \epsilon G_t^{-1/2} g_t.\)

Instead of using \(G_t\) to update the parameters, it is common to use \(\text{diag}(G_t + \delta I)\) with small \(\delta > 0\) to avoid singularity and simplify computation; hence the update rule becomes \(\theta_{t+1} \leftarrow \theta_t - \epsilon \cdot \text{diag}(G_t + \delta I)^{-1/2} g_t\).

To see why this is an adaptive learning algorithm, observe \(\epsilon \cdot \text{diag}(G_t + \delta I)^{-1/2}\). The diagonal of \(G_t + \delta I\) consists only of positive elements, and each element increases with each iteration. This means that the diagonal elements of \(\text{diag}(G_t + \delta I)^{-1/2}\) will shrink closer and closer to zero with each iteration - adapting the learning rate.

The primary disadvantage to AdaGrad is that the rate of learning rate decay depends on the path of the gradient steps. The RMSProp algorithm aims to quell this weakness by exponentially weighting the past gradient aggregations.

Adam

Adam belongs to a class of gradient descent algorithms with adaptive learning rates. The algorithm is as follows:

Choose a learning rate \(\epsilon\), two decay rates \(\beta_0, \beta_1 \in [0,1)\).
Choose initial first and second moment estimates of the gradient \(m_0 = v_0 = 0\).
Choose initial parameters \(\theta_0\).
Do the following until a stopping criterion is met:
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \[g_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t), y^{(i)})\]
3. Update the first moment estimate \(m_t \leftarrow \beta_0 \cdot m_{t-1} + (1 - \beta_0) \cdot g_t\).
4. Update the second raw moment estimate \(v_t \leftarrow \beta_1 \cdot v_{t-1} + (1 - \beta_1) \cdot g_t^2\).
5. Correct the first moment estimate \(m_t' \leftarrow \frac{m_t}{1-\beta_0^t}\).
6. Correct the second raw moment estimate \(v_t' \leftarrow \frac{v_t}{1 - \beta_1^t}\).
7. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_{t} - \frac{\epsilon}{\delta + \sqrt{v_t'}} \cdot m_t'\) for small \(\delta > 0\).

It is easy to see that Adam is an adaptive learning algorithm - one need only look at the update rule \(v_t \leftarrow \beta_1 \cdot v_{t-1} + (1 - \beta_1) \cdot g_t^2\) and notice that this sum can only increase with each iteration.

The exponential weighted sum of the gradient provides a biased estimate of the first moment of the stochastic gradient, taken from some true gradient distribution. Likewise, the exponential weighted sum of the squared gradients gives a biased esimate of second raw moment of the stochastic gradient. The biased estimates are then corrected in order to update the parameters. To see how the bias is corrected by a simple multiplication, we have \[\begin{aligned} \mathbb{E}(m_t) &= \mathbb{E}\left[(1 - \beta_0) \sum_{i=1}^{t} \beta_0^i \cdot g_{t-i}\right]\\ &= \mathbb{E}(g_t) \cdot (1 - \beta_0^t) + C. \end{aligned}\]

If \(\{g_t\}\) is stationary, the \(C = 0\). Otherwise, \(C\) can be kept small because the decay rate \(\beta_0\) can be chosen to heavily penalize gradients seen in the earlier steps. Therefore, the first moment \(\mathbb{E}(g_t)\) can be obtained by a corrected \((1 - \beta_0^t)^{-1} \cdot \mathbb{E}(m_t)\).

Batch optimization

Batch optimization uses the entire training set for each gradient step. It is very computationally expensive for large datasets since the model must evaluate every single data point for every step. However, it gives the “true” empirical gradient.

Bias

Suppose we have an estimate \(\hat{\theta}\) of a true parameter \(\theta\). Note that the estimator \(\hat{\theta}\) is stochastic. The bias of \(\hat{\theta}\) is defined as \[\begin{aligned} \text{Bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta. \end{aligned}\]

If \(\text{Bias}(\hat{\theta}) = 0\), then \(\hat{\theta}\) is called an unbiased estimator.

Bias-variance decomposition

Suppose we have an estimate \(\hat{\theta}\) of a true parameer \(\theta\). Not e that the estimator \(\hat{\theta}\) is stochastic. Then the mean squared-error of \(\hat{\theta}\) can be decomposed as follows: \[\begin{aligned} \text{MSE}(\hat{\theta}) &= \mathbb{E}[(\hat{\theta} - \theta)^2]\\ &= \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}). \end{aligned}\]

Bootstrap aggregating

Bootstrap aggregating (or “bagging”) is a model averaging method. The bagging algorithm is:

Suppose the training dataset has \(n\) data points. For \(i \in \{1, 2, \cdots, k\}\), create training set \(T_i\) by sampling with replacement from the training set \(n\) times.
Train model \(i\) with training set \(T_i\).
Average the outputs of the results from each model.

A quick argument for why bagging works: take \(k\) models, and suppose that model \(i\) makes an error \(\epsilon_i\), where \(\epsilon \sim \mathcal{N}(0, vI)\). Then the expected squared error from bagging is \[\begin{aligned} \mathbb{E}\left(k^{-1} \sum_{i} \epsilon_i\right)^2 = k^{-2} \cdot kv = \frac{v}{k}. \end{aligned}\]

Hence if the errors are uncorrelated, then we have cut down on the expected squared error just by averaging.

Ensemble models are a good example of trading off performance for compute + memory demands.

Dropout

Dropout is a regularization method that cheaply approximates bagging. It is also a method of applying noise to input and hidden units. The dropout algorithm works as follows:

With each mini-batch loading, randomly select a binary vector \(\mu \in \{0,1\}^d\), where \(d\) is the number of input and hidden units in the neural network. This vector represents which units to include/exclude in the model.
Run forward and backward propogation with \(\mu\) applied element-wise to the corresponding units, and report the cost \(J(\theta, \mu)\).
The goal of dropout is to get an unbiased estimate of \(\mathbb{E}_{\mu} J(\theta, \mu)\).

Dropout really is similar to bagging - a “new” model is created every time we apply a new binary vector \(\mu\), and each model experiences a separate “training set” (which is represented by the mini-batch). On the other hand, the key difference between bagging models and dropout models is that dropout models share model parameters with the “parent” model i.e. the neural network with \(\mu = [1, 1, \cdots, 1]\). Furthermore, aggregating the results of each dropout model is not as clear as averaging out the models results.

Because of the included stochasticity, the output of the model at test time should intuitively be the expected output during training time. Suppose a unit is included in a dropout model with probability \(p\), and the weights branching from this unit is \(\textbf{w}\). Averaged over all dropout combinations, the expected weights branching from unit is \(p \textbf{w}\) because the weights are zero if the unit is excluded. Hence, we can combine the results of the different dropout models by multiplying each of the weights by the probability of inclusion. Alternatively, we can multiply the weights by \(1/p\) during training, and use the vanilla weights during testing.

Although dropout applies a random binary vector, using a random vector of real values e.g. \(\mu \sim N(1, I)\) can yield similar results. This supports the idea that dropout is like adding noise to the neural network units.

Early stopping

Early stopping is an implicit regularization method where you stop the training process just before the model starts to overfit. The early stopping meta-algorithm works as follows:

For an epoch \(k\), train the model and retrieve an evaluation score by testing on a validation set.
If the evaluation score has improved from the last epoch, save the model parameters and the epoch number. Continue to the next epoch.
If the evaluation score has not improved for more than \(p\) epochs, then exit. (\(p\) is called the “patience”)
If the evaluation score has not improved, but it has not been more than \(p\) epochs since it last improved, continue to the next epoch.

The idea behind early stopping as an implicit regularizer is that the number of training steps is a hyperparameter - this means that stopping training after a variable number of steps produces a variety of models. Given that the model is complex enough to sufficiently learn the task, the validation curve should be U-shaped as a function of the number of training steps. By early stopping, we are constraining the effective capacity of the model.

Empirical risk minimization

Empirical risk minimization is the framework that sets up the details of the learning problem. Our ultimate goal is to minimize the expected generalization error (otherwise known as the risk): with a data generating distribution \(p_{\text{data}}\), the expected generalization error is \[\begin{aligned} J^*(\theta) = \mathbb{E}_{(x,y) \sim p_{\text{data}}} L(f(x; \theta), y). \end{aligned}\]

Because we do not have access to \(p_{\text{data}}\), the risk is intractable. Instead, we choose to optimize the empirical risk, which is the expected loss on the training set. Since the training set can be seen as being drawn from the empirical distribution \(\hat{p}_{\text{data}}\), the expected loss on the training set is defined as \[\begin{aligned} J(\theta) &= \mathbb{E}_{(x,y) \sim \hat{p}_{\text{data}}} L(f(x; \theta), y) \\ &= \frac{1}{n} \sum_{i=1}^n L(f(x_i; \theta), y_i). \end{aligned}\]

In practice, if the original loss function \(L\) does not have nice properties (not continuous e.g. 0-1 loss), a surrogate loss function is used as a proxy during training. For example, negative log likelihood loss is used as a surrogate for 0-1 loss. Regardless, the framework still holds - the goal is to find the parameters that minimize the empirical risk with the hope that true risk will also be minimized.

Gradient

Given a function \(f: \mathbb{R}^n \to \mathbb{R}\), the gradient of \(f\) at point \(x_0\) is denoted as \[\begin{aligned} \nabla f(x_0) = \begin{bmatrix} \frac{\partial}{\partial x_1} f(x_0) & \cdots & \frac{\partial}{\partial x_n} f(x_0) \end{bmatrix}^T. \end{aligned}\]

At a point \(x_0\), the gradient points in the direction of steepest ascent. This can be shown through the directional directive. We want to find a unit vector \(u^*\) such that \[\begin{aligned} u^* &= \arg \max_{||u||_2 = 1} \lim_{h \to 0} \frac{f(x_0 + h u) - f(x_0)}{h} \\ &= \arg \max_{||u||_2 = 1} (u^T) (\nabla f(x_0))\\ &= \arg \max_{||u||_2 = 1} ||u||_2 ||\nabla f(x_0)||_2 \cos \theta. \end{aligned}\]

Because \(\nabla f(x_0)\) is unaffected by \(u\) and \(||u||_2 = 1\), the above optimization is equivalent to \[\begin{aligned} u^* = \arg \max_{||u||_2 = 1} \cos \theta. \end{aligned}\]

This is maximized when \(\theta = 0\), so \(u^*\) is the normalized vector of \(\nabla f(x_0)\).

In contrast, \(u_* = \arg \min_{||u||_2 = 1} ||u||_2 ||\nabla f(x_0)||_2 \cos \theta = - u^*.\) Hence the gradient points in the direction of steepest ascent, and the opposite direction points in the direction of steepest descent.

Gradient descent

Gradient descent is an iterative algorithm for solving a minimization problem. Typically, we are trying to find a point \(x_*\) such that \(x_* \arg \min_{x} f(x)\) for some objective function \(f\). The core idea of gradient descent depends on the fact that the negative of the gradient points in the direction of steepest descent. Hence the algorithm works by iteratively stepping in the opposite direction to the gradient, using a step size \(\epsilon\): \[\begin{aligned} x^{(i+1)} \leftarrow x^{(i)} - \epsilon \cdot \nabla f(x^{(i)}). \end{aligned}\]

The only parameter that we can choose is the step size \(\epsilon\). To derive further insights about the step size, we can use the second order Taylor approximation. For notational convenience, we denote \(\nabla f(x^{(i)})\) as \(g\): \[\begin{aligned} f(x^{(i+1)}) &= f(x^{(i)} - \epsilon \cdot g)\\ &\approx f(x^{(i)}) - \epsilon \cdot g^T g + \frac{\epsilon^2}{2} \cdot g^T H g. \end{aligned}\]

Keep in mind that the goal is to iteratively find the minimum of \(f\). Depending on the value of \(g^T H g\), there are a few possible behaviors:

If \(g^T H g \leq 0\), then taking \(\epsilon \to \infty\) means that the cost \(f({x^{(i+1)}})\) will be infinitely smaller than \(f(x^{(i)})\).
If \(g^T H g > 0\), then taking the partial derivative of the above Taylor approximation with respect to \(\epsilon\), we get the optimal step size (the step size that gives the largest decrease going from \(x^{(i)} \to x^{(i+1)}\)) \[\begin{aligned} \epsilon^* = \frac{g^T g}{g^T H g}. \end{aligned}\]

Hessian

Given a function \(f: \mathbb{R}^n \to \mathbb{R}\) with second-order partial derivatives, the Hessian at point \(x_0\) is a matrix \(H(x_0) \in \mathbb{R}^{n \times n}\) such that \[\begin{aligned} H(x_0) = \begin{bmatrix} \frac{\partial^2}{\partial x_1^2} f(x_0) & \cdots & \frac{\partial^2}{\partial x_1 x_n} f(x_0)\\ \vdots & \ddots & \vdots\\ \frac{\partial^2}{\partial x_n x_1} f(x_0)& \cdots & \frac{\partial^2}{\partial x_n^2} f(x_0) \end{bmatrix}. \end{aligned}\]

The function \(f\) usually has continuous second-order partial derivatives, which means that the Hessian is a real symmetric matrix. Hence the Hessian is unitarily diagonalizable (eigenbasis consists of orthonormal vectors) by the spectral theorem.

L1 Regularization

\(L^1\) regularization is when you add an \(L^1\) penalty term to the cost function. In the case of simple linear regression, where \(\hat{y} = X\hat{\beta}\), \(L^1\) regularization yields the cost function \[\begin{aligned} J(\hat{\beta}) = ||y - X \hat{\beta}||_2^2 + \lambda ||\hat{\beta}||_1. \end{aligned}\]

In the case of a fully connected neural network with \(l\) hidden layers, we have \[\begin{aligned} J(\theta) &= J(W^{(1)}, b^{(1)}, \cdots, W^{(l)}, b^{(l)})\\ &= ||y - f(X)||_2^2 + \lambda \sum_{i=1}^{l} ||W^{(i)}||_1. \end{aligned}\]

Recall that the matrix \(1\)-norm goes by the nickname “max column sum”.

L2 Regularization

\(L^2\) regularization is when you add an \(L^2\) penalty term to the cost function. In the case of simple linear regression, where \(\hat{y} = X\hat{\beta}\), \(L^2\) regularization yields the cost function \[\begin{aligned} J(\hat{\beta}) = ||y - X \hat{\beta}||_2^2 + \lambda ||\hat{\beta}||_2^2. \end{aligned}\]

Recall that \(|| \cdot ||_F\) is the Frobenius norm: \(||X||_F^2 = ||\text{Vec}(X)||_2^2\).

It is worth noting that adding small-variance noise to the input vectors has an \(L^2\) regularization effect. Given points sampled i.i.d. from a distribution \(p(x,y)\), where \(x \in \mathbb{R}^d, y \in \mathbb{R}^c\), we have the cost function \[\begin{aligned} E(\hat{f}) &= \frac{1}{2} \iint ||y - \hat{f}(x)||_2^2 \cdot p(x,y) \text{ }dx \text{ }dy\\ &= \frac{1}{2} \sum_{k=1}^c \iint (y_k - \hat{f}_k(x))^2 \cdot p(y_k | x) \cdot p(x) \text{ }dx \text{ }dy_k. \end{aligned}\]

Suppose \(\zeta \sim \tilde{p}\) is a \(d\)-dimensional random vector, where \[\begin{aligned} \mathbb{E}(\zeta) = \vec{0} \text{ and } \int \zeta_i \zeta_j \tilde{p}(\zeta) \text{ }d \zeta = \eta^2 \delta_{ij}. \end{aligned}\]

Note that \(\delta_{ij} = 1\) if \(i=j\), and \(0\) otherwise. We perturb the inputs, and evaluate to get a cost function \[\begin{aligned} \tilde{E}(\hat{f}) = \frac{1}{2} \iint \sum_{k=1}^c \int (y_k - \hat{f}_k(x + \zeta))^2 \cdot p(y_k | x) \cdot p(x) \cdot \tilde{p}(\zeta) \text{ }dx \text{ }dy_k \text{ }d\zeta. \end{aligned}\]

Given sufficiently small \(\eta\), we can approximate \(\hat{f}_k(x + \zeta)\) with a second order Taylor approximation. Denoting the Hessian of \(\hat{f}_k\) evaluated at \(x\) as \(H_k\), we have: \[\begin{aligned} \hat{f}_k(x + \zeta) \approx \hat{f}_k(x) + \zeta^T \nabla \hat{f}_k(x) + \frac{1}{2}\zeta^T H_k \zeta. \end{aligned}\]

Substituting the approximation into \(\tilde{E}(\hat{f})\), we obtain \[\begin{aligned} \tilde{E}(\hat{f}) &= E(\hat{f}) + \eta^2 E^{R}(\hat{f}), \end{aligned}\] where \[\begin{aligned} E^{R}(\hat{f}) &= \frac{1}{2} \iint \sum_k (\nabla \hat{f}_k(x))^T (\nabla \hat{f}_k(x)) \cdot p(y_k | x) p(x) \text{ }dx \text{ }dy_k\\ &+ \frac{1}{4} \iint (y_k - \hat{f}_k(x)) \text{diag}(H_k) \cdot p(y_k | x) p(x) \text{ }dx \text{ }dy_k. \end{aligned}\]

Keeping in mind that fully connected neural networks are compositions of non-linear functions and affine transformations, it is easy to see that this gives Tikhonov regularization.

Label smoothing

Equivalent to adding noise to the output labels.

Learning problem

A learning problem is a triple \((T, E, P)\), where \(T\) is the task, \(E\) is experience, and \(P\) is performance. For a specific task \(T\), we have a bundle of data points \(E\) that will allow the model to experience the task, and we evaluate the model’s performance using a performance metric \(P\).

Take the example of a student learning how to solve linear equations. We might assign the student some homework problems, and later evaluate the student based on how many questions he/she got correct out of 20 questions. Here, the learning problem is (solving linear equations, homework assigments, test with 20 questions that is graded based on proportion of correctly answered questions).

Mini-batch optimization

Mini-batch optimization uses only a subset of the training set for each gradient step. For large datasets, mini-batch optimization is preferred to its batch counterpart due to the computational demands of batch optimization.

A quick heuristic arguments in favor of mini-batch optimization: for both batch and mini-batch optimization, the calculated gradient is an averaged estimator for the true gradient. Given a batch of size \(n\), the standard error is \(\frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the true standard deviation. While the size of the dataset grows linearly, the corresponding decrease in standard error scales less than linearly - at some point, the extra gains in variance reduction is not worth the compute.

Momentum

Momentum is an add-on to stochastic gradient descent to deal with the noisy aspect of the stochastic gradient. The algorithm is as follows:

Choose a learning rate \(\epsilon\), a dampening rate \(\alpha\), initial momentum \(v_0\), and initial parameters \(\theta_0\). Do the following until a stopping criterion is met:
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \(\hat{g}_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t), y^{(i)})\)
3. Compute the momentum \(v_{t+1} \leftarrow \alpha v_{t} - \epsilon \hat{g}_t\).
4. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_{t} + v_{t+1}.\)

Abusing notation just a bit, we can see that the new parameter estimate is found through a decaying accumulation of past gradient estimates: \[\begin{aligned} \theta_{t+1} &= \theta_t + v_{t+1}\\ &= \theta_t - \epsilon \hat{g}_t + \alpha v_t\\ &= \theta_t - \epsilon \hat{g}_t + \alpha (-\epsilon \hat{g}_{t-1} + \alpha v_{t-1})\\ &\vdots\\ &= \theta_t - \epsilon \left(\sum_{j=0}^{t} \alpha^{j} \hat{g}_{t-j} \right) + \alpha^{t+1} v_0. \end{aligned}\]

Nesterov momentum

Nesterov momentum is an add-on to stochastic gradient descent with momentum. The key difference is where the gradient is evaluated. The algorithm is as follows:

Choose a learning rate \(\epsilon\), a dampening rate \(\alpha\), initial momentum \(v_0\), and initial parameters \(\theta_0\). Do the following until a stopping criterion is met:
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \[\hat{g}_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t + \alpha v_t), y^{(i)})\]
3. Compute the momentum \(v_{t+1} \leftarrow \alpha v_{t} - \epsilon \hat{g}_t\).
4. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_{t} + v_{t+1}.\)

The gradient is evaluated closer to the next parameter update. Notice that in SGD with momentum, the parameter update is \(\theta_{t+1} \leftarrow [\theta_t + \alpha v_t] - \epsilon \hat{g}_t\). With Nesterov momentum, the gradient is evaluated with respect to \(\theta_t + \alpha v_t\) - closer to where the parameter will eventually update to.

Noise Robustness

Small perturbations a.k.a. “noise” can be added to the input points, the hidden layer units, the output labels, or the weights themselves. This has a regularizing effect - one could even say that it makes the network “antifragile”.

RMSProp

RMSProp belongs to a class of gradient descent algorithms with adaptive learning rates. RMSProp is closely related to AdaGrad - instead of aggregating the previous gradient estimates, RMSProp takes the exponential weighted average. The RMSProp is as follows:

Choose a learning rate \(\epsilon\), decay rate \(\alpha \in [0,1)\), and initial parameters \(\theta_0\). Do the following until a stopping criterion is met:
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \[g_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t), y^{(i)})\]
3. Aggregate and exponentially weight the outer product of all the previous gradient estimates \(G_t = \alpha G_{t-1} + (1 - \alpha) g_t g_t^T\).
4. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_t - \epsilon G_t^{-1/2} g_t.\)

To see why this is an adaptive learning algorithm, observe \(\epsilon \cdot \text{diag}(G_t + \delta I)^{-1/2}\). The diagonal of \(G_t + \delta I\) consists only of positive elements since \(\alpha \in [0,1)\), and each element increases with each iteration. This means that the diagonal elements of \(\text{diag}(G_t + \delta I)^{-1/2}\) will shrink closer and closer to zero with each iteration - adapting the learning rate.

Notice that RMSProp is the same as AdaGrad except that exponential weighting is applied to the gradient accumulation.

Stochastic gradient descent (SGD)

The stochastic gradient descent algorithm (SGD) is a mini-batch optimization algorithm. The algorithm is as follows:

Choose a learning rate \(\epsilon\) and initial parameters \(\theta_0\). Do the following until a stopping criterion is met (e.g. early stopping):
1. Take a mini-batch \(\{(x^{(i)}, y^{(i)})\}_{i=1,\cdots, m}\) from the training set.
2. Compute the gradient estimate \(\hat{g}_t = \frac{1}{m} \sum_{i=1}^m \nabla_{\theta} L(f(x^{(i)}; \theta_t), y^{(i)})\)
3. Update the parameter estimate \(\theta_{t+1} \leftarrow \theta_{t} - \epsilon \hat{g}_t.\)

SGD can be too noisy and/or unstable, and finding a good learning rate is critical. Momentum and adaptive learning algorithms solve the former and latter issues, respectively.

Validation set

The validation set is used to decide which model to use. The validation set is usually used in tandem with the training set - at the end of every training epoch, evaluate the trained model on a never-before-seen subset of the validation set. To evaluate the model, use the original loss instead of the surrogate loss used during training.