Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.

George Santayana

Three representative applications of deep learning in computational mechanics–involving numerical integration for finite element, complex constitutive model in solid mechanics, and proper orthogonal decomposition in fluid mechanics–are reviewed in detail, and used as motivation for a further in-depth review of some key technologies of deep learning, building up from the basics to the state of the art, focusing on, to the extent possible, the most recent papers that had an important impact in the field.

Both static and dynamic time-dependent problems are discussed. Discrete time-dependent problems, as a sequence of data, can be modeled with recurrent neural networks, using the 1997 classic architecture such as Long Short-Term Memory (LSTM), but also the recent 2017-18 architectures such as transformer, based on the concept of attention, all of which are discussed in detail. Continuous recurrent neural networks originally developed in neuroscience to model the brain and the connection to their discrete counterparts in deep learning are also discussed in detail.

For training networks–i.e., finding optimal parameters that yield low training error and lowest validation error–both classic deterministic optimization methods (using full batch) and stochastic optimization methods (using minibatches) are reviewed in detail, and at times even derived. Deterministic gradient descent with classical line search methods, such as Armijo’s rule, were generalized to add stochasticity. Detailed pseudocodes for these methods are provided. The classic stochastic gradient descent (SGD), with add-on tricks such as momentum, step-length decay, cyclic annealing, weight decay are presented, often with detailed derivations.

Step-length decay is shown to be equivalent to simulated annealing using stochastic differential equation equivalent to the discrete parameter update. A consequence is to increase the minibatch size, instead of decaying the step length. In particular, we obtain a new result for minibatch-size increase.

Highly popular adaptive step-length (learning-rate) methods are discussed in a unified manner, which covers AdaGrad, RMSProp, the “immensely successful” Adam and its variants, through to the recent AdamW.

Overlooked in (or unknown to) other review papers and even well-known books on deep learning, exponential smoothing of time series, the key technique of adaptive methods, and originating from the field of forecasting dated since the 1950s, is carefully explained.

Particular attention is given to a recent criticism of adaptive methods, revealing their marginal value for generalization, compared to good old SGD with effective initial step-length tuning and decay. The results were confirmed in three recent independent papers.

Kernel machines, including Gaussian processes, a most important class of non-parametric modeling with accurate uncertainty estimates, are introduced to sufficient details to prepare for more advanced works on networks with infinite width that constitute the 2021 breakthrough in computer science.

Applications of deep learning in computational mechanics often aim at reducing computational cost, which naturally connects to the field of (nonlinear) model-order reduction (MOR). We review how LSTM networks were trained to predict the rate-dependent constitutive response in multi-scale problem of porous media and how they were used as time-integrators for reduced order models (ROMs) inferred from highly-resolved direct numerical simulations of turbulent flows. Autoencoders based on shallow networks provide effective means in nonlinear manifold-based MOR and hyper-reduction method built on top.

A rare feature of the present paper is in a detailed review of some important classics to connect to the relevant concepts in modern literature, sometimes revealing misunderstanding in recent works, which was likely due to a lack of verification of the assertions made with the corresponding classics. For example, the first artificial neural network, conceived by Rosenblatt (1957) [

The experiments in the 1950s that discovered the rectified linear behavior in neuronal axon, modeled as a circuit with a diode, together with the use of the rectified linear activation function in neural networks in neuroscience years before being adopted for use in deep-learning networks, are reviewed.

The use of Volterra series to model the nolinear behavior of neuron in term of input and output firing rates, leading to continuous recurrent neural networks is examined in detail. The linear term of the Volterra series is a convolution integral that provides a theoretical foundation for the use of linear combination of inputs to a neuron, with weights and biases.

A goal of this in-depth review is not only to provide the state of the art for computational-mechanics readers with some familiarity of deep-learning networks, but also with first-time learners in mind, by developing relevant fundamental concepts from the basics. Moreover, for the convenience of the readers, detailed references are provided, e.g., page numbers in thick books, links to online references and open reviews where available.

^{1}

See also the Midjourney

In 2021, an AI software achieved a feat that human researchers were not able to do in the last 50 years in predicting protein structures quickly and in a large scale. This feat was named the scientific breakthough of the year; Figure

On 2022.10.05, DeepMind published a paper on breaking a 50-year record of fast matrix multiplication by reducing the number of multiplications in multiplying two ^{2}

Their goal was of course to discover fast multiplication algorithms for matrices of arbitrarily large size. See also “Discovering novel algorithms with AlphaTensor,” DeepMind, 2022.10.05,

Since the preprint of this paper was posted on the arXiv in Dec 2022 [

^{3}

Tensors are not matrices; other concepts are summation convention on repeated indices, chain rule, and matrix index convention for natural conversion from component form to matrix (and then tensor) form. See Section

For readers not familiar with deep learning, unlike many other review papers, this review paper is not just a summary of papers in the literature for people who already have some familiarity with this topic,^{4}

See the review papers on deep learning, e.g., [

^{5}

An example of a confusing point for

Readers already familiar with neural networks may find the presentation refreshing,^{6}

Particularly the top-down approach for both feedforward network (Section

^{7}

It took five years from the publication of Rumelhart

Fully-connected feedforward neural networks were employed to make element-matrix integration more efficient, while retaining the accuracy of the traditional Gauss-Legendre quadrature [^{8}

It would be interesting to investigate on how the adjusted integration weights using the method in [

Recurrent neural network (RNN) with Long Short-Term Memory (LSTM) units^{9}

It is only a coincidence that (1) Hochreiter (1997), the first author in [

RNNs with LSTM units were employed to obtain reduced-order model for turbulence in fluids based on the proper orthogonal decomposition (POD), a classic linear project method also known as principal components analysis (PCA) [

The results of deep-learning numerical integration [

All of the deep-learning concepts identified from the above selected papers for in-depth are subsequently explained in detail in Sections

The parallelism between computational mechanics, neuroscience, and deep learning is summarized in Section

Both time-independent (static) and time-dependent (dynamic) problems are discussed. The architecture of (static, time-independent) feedforward multilayer neural networks in Section

Backpropagation, explained in Section

For training networks—i.e., finding optimal parameters that yield low training error and lowest validation error—both classic deterministic optimization methods (using full batch) and stochastic optimization methods (using minibatches) are reviewed in detail, and at times even derived, in Section

The examples used in training a network form the training set, which is complemented by the validation set (to determine when to stop the optimization iterations) and the test set (to see whether the resulting network could work on examples never seen before); see Section

In Section

Overlooked in (or unknown to) other review papers and even well-known books on deep learning, exponential smoothing of time series originating from the field of forecasting dated since the 1950s, the key technique of adaptive methods, is carefully explained in Section

The first adaptive methods that employed exponential smoothing were

Particular attention is then given to a recent criticism of adaptive methods in [

Dynamics, sequential data, and sequence modeling are the subjects of Section

The features of several popular, open-source deep-learning frameworks and libraries—such as TensorFlow, Keras, PyTorch, etc.—are summarized in Section

As mentioned above, detailed formulations of deep learning applied to computational mechanics in [

A rare feature is in a detailed review of some important classics to connect to the relevant concepts in modern literature, sometimes revealing misunderstanding in recent works, likely due to a lack of verification of the assertions made with the corresponding classics. For example, the first artificial neural network, conceived by Rosenblatt (1957) [

The use of Volterra series to model the nonlinear behavior of neuron in term of input and output firing rates, leading to continuous recurrent neural networks is examined in detail. The linear term of the Volterra series is a convolution integral that provides a theoretical foundation for the use of linear combination of inputs to a neuron, with weights and biases [

The experiments in the 1950s by Furshpan et al. [

^{10}

While in the long run an original website may be moved or even deleted, the same website captured on the Internet Archive (also known as Web Archive or Wayback Machine) remains there permanently.

In Dec 2021, the journal

The 3-D shape of a protein, obtained by folding a linear chain of amino acid, determines how this protein would interact with other molecules, and thus establishes its biological functions [^{11}

See also AlphaFold Protein Structure Database

On the 2019 new-year day,

Go is the most complex game that mankind ever created, with more combinations of possible moves than chess, and thus the number of atoms in the observable universe.^{12}

The number of atoms in the observable universe is estimated at

This breakthrough is the crowning achievement in a string of astounding successes of deep learning (and reinforcenent learning) in taking on this difficult challenge for AI.^{13}

See [

In its long history, AI research went through several cycles of ups and downs, in and out of fashion, as described in [

“THE TERM “artificial intelligence” has been associated with hubris and disappointment since its earliest days. It was coined in a research proposal from 1956, which imagined that significant progress could be made in getting machines to “solve kinds of problems now reserved for humans if a carefully selected group of scientists work on it together for a summer”. That proved to be rather optimistic, to say the least, and despite occasional bursts of progress and enthusiasm in the decades that followed, AI research became notorious for promising much more than it could deliver. Researchers mostly ended up avoiding the term altogether, preferring to talk instead about “expert systems” or “neural networks”. But in the past couple of years there has been a dramatic turnaround. Suddenly AI systems are achieving impressive results in a range of tasks, and people are once again using the term without embarrassment.”

The recent resurgence of enthusiasm for AI research and applications dated only since 2012 with a spectacular success of almost halving the error rate in image classification in the ImageNet competition,^{14}

“ImageNet is an online database of millions of images, all labelled by hand. For any given word, such as “balloon” or “strawberry”, ImageNet contains several hundred images. The annual ImageNet contest encourages those in the field To compete and measure their progress in getting computers to recognise and label images automatically” [

^{15}

For a report on the human image classification error rate of 5.1%, see [

The 2012 success^{16}

Actually, the first success of deep learning occurred three years earlier in 2009 in speech recognition; see Section

^{17}

See [

Availability of much larger datasets for training deep neural networks (find optimized parameters). It is possible to say that without ImageNet, there would be no spectacular success in 2012, and thus no resurgence of AI. Once the importance of having large datasets to develop versatile, working deep networks was realized, many more large datasets have been developed. See, e.g., [

Emergence of more powerful computers than in the 1990s, e.g., the graphical processing unit (or GPU), “which packs thousands of relatively simple processing cores on a single chip” for use to process and display complex imagery, and to provide fast actions in today’s video games” [

Advanced software infrastructure (libraries) that facilitates faster development of deep-learning applications, e.g., TensorFlow, PyTorch, Keras, MXNet, etc. [

Larger neural networks and better training techniques (i.e., optimizing network parameters) that were not available in the 1980s. Today’s much larger networks, which can solve once intractatable / difficult problems, are “one of the most important trends in the history of deep learning”, but are still much smaller than the nervous system of a frog [^{18}

The authors of [

Successful applications to difficult, complex problems that help people in their every-day lives, e.g., image recognition, speech translation, etc.

^{19}

Intensive Care Unit.

and the FDA^{20}

Food and Drug Administration.

^{21}

At video time 1:51. In less than a year, this 2018 April TED talk had more than two million views as of 2019 March.

“About 10 years ago, the grand AI discovery was made by three North American scientists,^{22}

See Footnote

Section

It was, however, disappointing that despite the above-mentioned exciting outcomes of AI, during the Covid-19 pandemic beginning in 2020,^{23}

“The World Health Organization declares COVID-19 a pandemic” on 2020 Mar 11,

^{24}

Krisher T.,

An image-recognition software useful for computational mechanicists is ^{25}

We thank Kerem Uguz for informing the senior author LVQ about Mathpix.

which recognizes hand-written math equations, and transforms them into LaTex codes. For example,into this LaTeX code “

that ^{26}

We want to immediately clarify the meaning of the terminologies “Artificial Intelligence” (AI), “Machine Learning” (ML), and “Deep Learning” (DL), since their casual use could be confusing for first-time learners.

For example, it was stated in a review of primarily two computer-science topics called “Neural Networks” (NNs) and “Support Vector Machines” (SVMs) and a physics topic that [^{27}

We are only concerned with NNs, not SVMs, in the present paper.

“The respective underlying fields of basic research—quantum information versus machine learning (ML) and artificial intelligence (AI)—have their own specific questions and challenges, which have hitherto been investigated largely independently.”

Questions would immediately arise in the mind of first-time learners: Are ML and AI two different fields, or the same fields with different names? If one field is a subset of the other, then would it be more general to just refer to the larger set? On the other hand, would it be more specific to just refer to the subset?

In fact, Deep Learning is a subset of methods inside a larger set of methods known as Machine Learning, which in itself is a subset of methods generally known as Artificial Intelligence. In other words, Deep Learning is Machine Learning, which is Artificial Intelligence; [^{28}

References to books are accompanied with page numbers for specific information cited here so readers don’t waste time to wade through an 800-page book to look for such information.

On the other hand, Artificial Intelligence is not necessarily Machine Learning, which in itself is not necessarily Deep Learning.The review in [^{29}

Network depth and size are discussed in Section

^{30}

See, e.g., [

Based on the above relationship between AI, ML, and DL, it would be much clearer if the phrase “machine learning (ML) and artificial intelligence (AI)” in both the title of [^{31}

For more on Support Vector Machine (SVM), see [

^{32}

See [

Another reason for simplifying the title in [

The engine of neuromorphic computing, also known as spiking computing, is a hardware network built into the IBM TrueNorth chip, which contains “1 million programmable spiking neurons and 256 million configurable synapses”,^{33}

The neurons are the computing units, and the synapses the memory instead of grouping the computing units into a central processing unit (CPU), separated from the memory, and connect the CPU and the memory via a bus, which creates a communication bottleneck, like the brain, each neuron in the TrueNorth chip has its own synapses (local memory).

and consumes “extremely low power” [^{34}

In [

As motivation, we present in this section the results in three recent papers in computational mechanics, mentioned in the Opening Remarks in Section

To integrate efficiently and accurately the element matrices in a general finite element mesh of 3-D hexahedral elements (including distorted elements), the power of Deep Learning was harnessed in two applications of ^{35}

MLN is also called MultiLayer Perceptron (MLP); see Footnote

(1) Application 1.1: For each element (particularly distorted elements), find the number of integration points that provides accurate integration within a given error tolerance. Section

(2) Application 1.2: Uniformly use ^{36}

The

To ^{37}

See Section

While Application 1.1 used one

To train the classifier network, 10,000 element shapes were selected from the prepared dataset of 20,000 hexahedrals, which were divided into a ^{38}

For the definition of training set and test set, see Section

To train the second regression network, 10,000 element shapes were selected for which quadrature could be improved by adjusting the quadrature weights [

Again, the training set and the test set comprised 5000 elements each. The parameters of the neural networks (

The best results were obtained from a classifier with four ^{39}

Information provided by author A. Oishi of [

To quantify the effectiveness of the approach in [

For most element shapes of both the training set (a) and the test set (b), each of which comprised 5000 elements, the blue bars in Figure

Readers familiar with Deep Learning and neural networks can go directly to Section

Readers not familiar with Deep Learning and neural networks will find below a list of the concepts that will be explained in subsequent sections. To facilitate the reading, we also provide the section number (and the link to jump to) for each concept.

(1) Feedforward neural network (Figure

(2) Neuron (Figure

(3) Inputs, output, hidden layers, Section

(4) Network depth and width: Section

(5) Parameters, weights, biases

(6) Activation functions: Section

(7) What is “deep” in “deep networks” ? Size, architecture, Section

(8) Backpropagation, computation of gradient: Section

(9) Loss (cost, error) function, Section

(10) Training, optimization, stochastic gradient descent: Section

(11) Training error, validation error, test (or generalization) error: Section

This list is continued further

One way that deep learning can be used in solid mechanics is to model complex, nonlinear constitutive behavior of materials. In single physics, balance of linear momentum and strain-displacement relation are considered as definitions or “universal principles”, leaving the constitutive law, or stress-strain relation, to a large number of models that have limitations, no matter how advanced [

Deep ^{40}

Porosity is the ratio of void volume over total volume. Permeability is a scaling factor, which when multiplied by the negative of the pressure gradient, and divided by the fluid dynamic viscosity, gives the fluid velocity in Darcy’s law, Eq. (

The

Since 60% of the world’s oil reserve and 40% of the world’s gas reserve are held in carbonate rocks, there has been a clear interest in developing an understanding of the mechanical behavior of carbonate rocks such as limestones, having from lowest porosity (Solenhofen at 3%) to high porosity (e.g., Majella at 30%). Chalk (Lixhe) is a carbonate rock with highest porosity at 42.8%. Carbonate rock reservoirs are also considered to store carbon dioxide and nuclear waste [

In oil-reservoir simulations in which the primary interest is the flow of oil, water, and solvent, the porosity (and pore size) within each domain (rock matrix or fracture system) is treated as constant and homogeneous [^{41}

See, e.g., [

Moreover, pores have different sizes, and can be classified into different pore sub-systems. For the Majella limestone in Figure

Likewise, the meaning of “dual permeability” is different in [^{42}

At least at the beginning of Section 2 in [

In the problem investigated in [

Instead of coupling multiple simulation models online, two (adjacent) scales were linked by a neural network that was trained offline using data generated by simulations on the smaller scale [

(1)

(2)

Path-dependence is a common characteristic feature of the constitutive models that are often realized as neural networks; see, e.g., [

An important observation is that including micro-structural data—the porosity ^{44}

The _{4})_{4} complex (Wikipedia version 08:38, 12 March 2019) has 12 hydrogen atoms bonded to the central uranium atom.

Figure

(12) Recurrent neural network (RNN), Section

(13) Long Short-Term Memory (LSTM), Section

(14) Attention and Transformer, Section

(15) Dropout layer and dropout rate,^{45}

Briefly, dropout means to drop or to remove non-output units (neurons) from a base network, thus creating an ensemble of sub-networks (or models) to be trained for each example, and can also be considered as a way to add noise to inputs, particularly of hidden layers, to train the base network, thus making it more robust, since neural networks were known to be not robust to noise. Adding noise is also equivalent to increasing the size of the dataset for training, [

Details of the formulation in [

The accurate simulation of turbulence in fluid flows ranks among the most demanding tasks in computational mechanics. Owing to both the spatial and the temporal resolution, transient analysis of turbulence by means of high-fidelity methods such as Large Eddy Simulation (LES) or direct numerical simulation (DNS) involves millions of unknowns even for simple domains.

To simulate complex geometries over larger time periods, reduced-order models (ROMs) that can capture the key features of turbulent flows within a low-dimensional approximation space need to be resorted to. Proper Orthogonal Decomposition (POD) is a common data-driven approach to construct an orthogonal basis

where

where

In a Galerkin-Project (GP) approach to reduced-order model, a small subset of dominant modes form a basis onto which high-dimensional differential equations are projected to obtain a set of lower-dimensional differential equations for cost-efficient computational analysis.

Instead of using GP, RNNs (Recurrent Neural Networks) were used in [

To obtain training/testing data, which were crucial to train/test neural networks, the data from transient 3-D Direct Navier-Stokes (DNS) simulations of two physical problems, as provided by the Johns Hopkins turbulence database [

To generate training data for LSTM/BiLSTM networks, the 3-D turbulent fluid flow domain of each physical problem was decomposed into five equidistant 2-D planes (slices), with one additional equidistant 2-D plane served to generate testing data (Section

(1)

(2)

For both methods, variants with the original LSTM units or the BiLSTM units were implemented. Each of the employed RNN had a single hidden layer.

Demonstrative results for the prediction capabilities of both the original LSTM and the BiLSTM networks are illustrated in Figure

Details of the formulation in [

Table

We assume that readers are familiar with the concepts listed in the second column on “Computational mechanics”, and briefly explain some key concepts in the third column on “Neuroscience” to connect to the fourth column “Deep learning”, which is explained in detail in subsequent sections.

See Section

Neuron spiking response such as shown in Figure ^{46}

From here on, if Eq. (

where ^{47}

Eq. (

where

It will be seen in Section

The Integrate-and-Fire model for biological neuron provides a motivation for the use of the rectified linear units (ReLU) as activation function in multilayer neural networks (or perceptrons); see Figure

Eq. (

We examine in detail the forward propagation in feedforward networks, in which the function mappings flow^{48}

There is no physical flow here, only function mappings.

only one forward direction, from input to output.There are two ways to present the concept of deep-learning neural networks: The top-down approach versus the bottom-up approach.

The

Specifically, for a multilayer feedforward network, by top-down, we mean starting from a general description in Eq. (

In terms of block diagrams, we begin our

The ^{49}

Figure2 in[

Unfamiliar readers when looking at the graphical representation of an artificial neural network (see, e.g., Figure

In mechanics and physics, tensors are intrinsic geometrical objects, which can be represented by infinitely many matrices of components, depending on the coordinate systems.^{50}

See, e.g., [

^{51}

See, e.g., [

Study object | Engineering continuum | The Brain | Image recognition |
---|---|---|---|

Field | Computational mechanics | Computational neuroscience | Deep learning |

Modeling |
Partial Differential Equations |
Biological neural networks |
Artificial neural networks |

1 | Weak form, finite-element mesh, order of interpolation | Network architectures, two layers (input, output), several neurons per layer | Network architectures, many layers (input, hidden, output), very high number of neurons and parameters |

2 | Elements | Neurons, dendrites, synapses, axons | Processing units (neurons, perceptrons) |

3 | Nonlinear force-displacement and stress-strain (σ-ϵ) relations | Firing model, spiking model, firing rate vs input current (FI) relation, continuous stimulus and response, Volterra series, kernels of increasing orders | — |

4 | Linearized force-displacement and stress-strain relations (Hooke’s law) | Linear term in Volterra series, synaptic kernel 𝒦_{1}(τ) of order 1, continuous temporal weight |
Many hidden layers (discrete weights and biases) |

5 | — | Linear combination of inputs, with input weights |
Linear combination of inputs plus biases, input weights |

6 | — | Static nonlinearity | Activation function |

Outputs | Displacements (solids), velocities (fluids) | Firing rate as response | Image classified (car, frog, human) |

The matrix notation used here can follow either (1) the Matlab / Octave code syntax, or (2) the more compact component convention for tensors in mechanics.

Using Matlab / Octave code syntax, the inputs to a network (to be defined soon) are gathered in an ^{52}

The inputs

Using the component convention for tensors in mechanics,^{53}

See, e.g., [

In case both indices are subscripts, then the left subscript (index

In case one index is a superscript, and the other index is a subscript, then the superscript (upper index ^{54}

See, e.g., [

With this convention (lower index designates column index, while upper index designates row index), the coefficients of array

Consider the Jacobian matrix
^{55}

For example, the coefficient

Consider the scalar function ^{56}

In [

Now consider this particular scalar function below:^{57}

Soon, it will be seen in Eq. (

^{58}

The gradients of

A fully-connected feedforward network is a chain of successive applications of functions ^{59}

To alleviate the notation, the predicted output

The notation

The quantities associated with layer

are the

The output for layer

and can be used interchangeably. In the current Section

The above chain in Eq. (^{60}

See [

^{61}

In the review paper [

A function can be graphically represented as in Figure

The multiple levels of compositions in Eq. (

revealing the structure of the

First, an affine transformation on the inputs (see Eq. (

The column matrix

is a linear combination of the inputs in ^{62}

See Eq. (

where the ^{63}

It should be noted that the use of both

and the ^{64}

Eq. (

Both the weights and the biases are collectively known as the network parameters, defined in the following matrices for layer

For simplicity and convenience, the set of all parameters in the network is denoted by ^{65}

For the convenience in further reading, wherever possible, we use the same notation as in [

Note that the set

Similar to the definition of the parameter matrix

with

The total number of parameters of a fully-connected feedforward network is then

But why using a linear (additive) combination (or superposition) of inputs with weights, plus biases, as expressed in Eq. (

An activation function

Without the activation function, the neural network is simply a linear regression, and cannot learn and perform complex tasks, such as image classification, language translation, guiding a driver-less car, etc. See Figure

An example is a linear one-layer network, without activation function, being unable to represent the seemingly simple XOR (exclusive-or) function, which brought down the first wave of AI (cybernetics), and that is described in Section

^{66}

“In modern neural networks, The default recommendation is to use the rectified linear unit, or ReLU,” [

^{67}

The notation

and depicted in Figure ^{68}

A similar relation can be applied to define the Leaky ReLU in Eq. (

^{69}

In [

^{70}

See, e.g., [

To transform an alternative current into a direct current, the first step is to rectify the alternative current by eliminating its negative parts, and thus The meaning of the adjective “rectified” in

Mathematically, a periodic function remains periodic after passing through a (nonlinear) rectifier (active function):

where

Biological neurons encode and transmit information over long distance by generating (firing) electrical pulses called action potentials or spikes with a wide range of frequencies [

The Shockley equation for a current

With the voltage across the resistance being

which is plotted in Figure

Prior to the introduction of ReLU, which had been long widely used in neuroscience as activation function prior to 2011,^{71}

See, e.g., [

^{72}

See Section

“While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. Rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero.”

The hard non-linearity of ReLU is localized at zero, but otherwise ReLU is a very simple function—identity map for positive argument, zero for negative argument—making it highly efficient for computation.

Also, due to errors in numerical computation, it is rare to hit exactly zero, where there is a hard non-linearity in ReLU:

“In the case of

Thus, in addition to the ability to train deep networks, another advantage of using ReLU is the high efficiency in computing both the layer outputs and the gradients for use in optimizing the parameters (weights and biases) to lower cost or loss, i.e., training; see Section

The activation function ReLU approximates closer to how biological neurons work than other activation functions (e.g., logistic sigmoid, tanh, etc.), as it was established through experiments some sixty years ago, and have been used in neuroscience long (at least ten years) before being adopted in deep learning in 2011. Its use in deep learning is a clear influence from neuroscience; see Section

Deep-learning networks using ReLU mimic biological neural networks in the brain through a trade-off between two competing properties [

^{73}

See the definition of image “predicate” or image “feature” in Section

The block diagram for a one-layer network is given in Figure

For a multilayer neural network with

And finally, we now complete our

The XOR (exclusive-or) function played an important role in bringing down the first wave of AI, known as the cybernetics wave ([

1 | 0 | |

2 | 1 | |

3 | 1 | |

4 | 0 |

The dataset or design matrix^{74}

See [

An approximation (or prediction) for the XOR function

We begin with a one-layer network to show that it cannot represent the XOR function,^{75}

This one-layer network is not the Rosenblatt perceptron in Figure

Consider the following one-layer network,^{76}

See [

with the following matrices

since it is written in [

“Model based on the

First-time learners, who have not seen the definition of Rosenblatt’s (1958) perceptron [^{77}

See Section

The MSE cost function in Eq. (

Setting the gradient of the cost function in Eq. (

from which the predicted output

and thus this one-layer network cannot represent the XOR function. Eqs. (^{78}

In least-square linear regression, the normal equations are often presented in matrix form, starting from the errors (or residuals) at the data points, gathered in the matrix

The four points in Table

“It has, in fact, been widely conceded by psychologists that there is little point in trying to ‘disprove’ any of the major learning theories in use today, since by extension, or a change in parameters, they have all proved capable of adapting to any specific empirical data. In considering this approach, one is reminded of a remark attributed to Kistiakowsky, that

So we now add a second layer, and thus more parameters in the hope to be able to represent the XOR function, as shown in Figure ^{79}

Our presentation is more detailed and more general than in [

To map the two points

For activation functions such as ReLu or Heaviside^{80}

In general, the Heaviside function is not used as activation function since its gradient is zero, and thus would not work for gradient descent. But for this XOR problem

and thus

For general activation function

with three distinct points in Eq. (

We have three equations:

for which the exact analytical solution for the parameters

Activation function | Parameters |
---|---|

ReLU | |

Heaviside | |

Sigmoid |

We conjecture that any (nonlinear) function

But it was only more than sixty years later that physicists were able to plot an elephant in 2-D using a model with four complex numbers as parameters [

With nine parameters, the elephant can be made to walk (representing the XOR function), and with a billion parameters, it may even perform some acrobatic maneuver in 3-D; see Section

The concept of network depth turns out to be more complex than initially thought. While for a ^{81}

There are two viewpoints on the definition of depth, one based on the computational graph, and one based on the conceptual graph. From the computational-graph viewpoint, depth is the number of sequential instructions that must be executed in an architecture. From the conceptual-graph viewpoint, depth is the number of concept levels, going from simple concepts to more complex concepts. See also [

“There is no single correct value for the depth of an architecture,^{82}

There are several different network architectures.

For example, keeping the number of layers the same, then the “depth” of a

The lack of consensus on the boundary between “shallow” and “deep” networks is echoed in [

“At which problem depth does

The review paper [

An example of recognizing multidigit numbers in photographs of addresses, in which the test accuracy increased (or test error decreased) with increasing depth, is provided in [

But it is not clear where in [

“An image, for example, comes in the form of an array of pixel values, and the learned features in the

But the above was not a criterion for a network to be considered as “deep”. It was further noted on the number of the model parameters (weights and biases) and the size of the training dataset for a “typical deep-learning system” as follows [

“ In a typical

See Remark

“Recent ConvNet [convolutional neural network, or CNN]^{83}

A special type of deep network that went out of favor, then now back in favor, among the computer-vision and machine-learning communities after the spectacular success that ConvNet garnered at the 2012 ImageNet competition; see [

^{84}

A network processing “unit” is also called a “neuron”.

A neural network with 160 billion parameters was perhaps the largest in 2015 [

“Digital Reasoning, a cognitive computing company based in Franklin, Tenn., recently announced that it has trained a neural network consisting of 160 billion parameters—more than 10 times larger than previous neural networks.

The Digital Reasoning neural network easily surpassed previous records held by Google’s 11.2-billion parameter system and Lawrence Livermore National Laboratory’s 15-billion parameter system.”

As mentioned above, for general network architectures (other than feedforward networks), not only that there is no consensus on the definition of depth, there is also no consensus on how much depth a network must have to qualify as being “deep”; see [

“Deep learning can be safely regarded as the study of models that involve a greater amount of composition of either learned functions or learned concepts than traditional machine learning does.”

Figure

The architecture of a network is the number of layers (depth), the layer width (number of neurons per layer), and the connection among the neurons.^{85}

See [

One example of an architecture different from that fully-connected feedforward networks is convolutional neural networks, which are based on the convolutional integral (see Eq. (

“Convolutional networks were also some of the first neural networks to solve important commercial applications and remain at the forefront of commercial applications of deep learning today. By the end of the 1990s, this system deployed by NEC was reading over 10 percent of all the checks in the United States. Later, several OCR and handwriting recognition systems based on convolutional nets were deployed by Microsoft.” [

“Fully-connected networks were believed not to work well. It may be that the primary barriers to the success of neural networks were psychological (practitioners did not expect neural networks to work, so they did not make a serious effort to use neural networks). Whatever the case, it is fortunate that convolutional networks performed well decades ago. In many ways, they carried the torch for the rest of deep learning and paved the way to the acceptance of neural networks in general.” [

Here, we present a more recent and successful network architecture different from the fully-connected feedforward network. Residual network was introduced in [

The basic building block of residual network is shown in Figure

The identity map that jumps over a number of layers in the residual network building block in Figure

A deep residual network with more than 1,200 layers was proposed in [

It is still not clear why some architecture worked well, while others did not:

“The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles.” [

Backpropagation, sometimes abbreviated as “backprop”, was a child of whom many could claim to be the father, and is used to compute the gradient of the cost function with respect to the parameters (weights and biases); see Section

Two types of cost function are discussed here: (1) the mean squared error (MSE), and (2) the maximum likelihood (probability cost).^{86}

For other types of loss function, see, e.g., (1) Section “Loss functions” in “torch.nn—PyTorch Master Documentation” (

For a given input

The factor ^{87}

There is an inconsistent use of notation in [

While the components ^{88}

In our notation,

where

Many (if not most) modern networks employed a probability cost function based in the principle of maximum likelihood, which has the form of negative log-likelihood, describing the cross-entropy between the training data with probability distribution

where

The expectations of a function ^{89}

The simplified notation

called the information content of

The product (chain) rule of conditional probabilities consists of expressing a joint probability of several random variables ^{90}

See, e.g., [

The logarithm of the products in Eq. (

The parameters ^{91}

A tilde is put on top of

where

is called the ^{92}

See, e.g., [

as in Eq. (^{93}

The normal (Gaussian) distribution of scalar random variable

with

Then summing Eq. (

and thus the minimizer

where the MSE cost function

Thus finding the minimizer of the maximum likelihood cost function in Eq. (

Remark ^{94}

See, e.g., [

In classification tasks—such as used in [^{95}

See, e.g., [

The output of the neural network is supposed to represent the probability

In case more than two categories occur in a classification problem, a neural network is trained to estimate the probability distribution over the discrete number (

For this purpose, the idea of

and is generalized then to vector-valued outputs; see also Figure

The

and is a smoothed version of the max function [^{96}

See also [

^{97}

Since the probability of

where the product rule was applied to the numerator of Eq. (

as in Eq. (^{98}

See also [

Using a different definition, the softmax function (version 2) can be written as

which is the same as Eq. (

The gradient of a cost function

We focus our attention on developing backpropagation for fully-connected networks, for which an explicit derivation was not provided in [^{99}

See [

It is convenient to recall here some equations developed earlier (keeping the same equation numbers) for the computation of the gradient

Cost function

Inputs

Weighted sum of inputs and biases

Network parameters

Expanded layer outputs

Activation function

The gradient of the cost function

The above equations are valid for the last layer

Using Eq. (

where

and

which then agrees with the matrix dimension in the first expression for

with

Comparing Eq. (

is only needed to be computed once for use to compute both the gradient of the cost

and the gradient of the cost

The block diagram for backpropagation at layer

To demonstrate the vanishing gradient problem, a network is used in [

We note immediately that the vanishing / exploding gradient problem can be resolved using the rectified linear function (ReLu, Figure ^{100}

See [

The speed of learning of a hidden layer

The speed of learning in each of the four layers as a function of the number of epochs^{101}

An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a formal definition of “epoch”, see Section

To understand the reason for the quick and significant decrease in the speed of learning, consider a network with four layers, having one scalar input ^{102}

See also [

The neuron in layer

As an example of computing the gradient, the derivative of the cost function

The back propagation procedure to compute the gradient

Whether the gradient

In other mixed cases, the problem of vanishing or exploding gradient could be alleviated by the changing of the magnitude

While the vanishing gradient problem for multilayer networks (static case) may be alleviated by weights that vary from layer to layer (the mixed cases mentioned above), this problem is especially critical in the case of Recurrent Neural Networks, since the weights stay constant for all state numbers (or “time”) in a sequence of data. See Remark

The first derivatives of the sigmoid function and hyperbolic tangent function depicted in Figure

and are less than 1 in magnitude (everywhere for the sigmoid function, and almost everywhere for the hyperbolic tangent tanh function), except at

The exploding gradient problem is opposite to the vanishing gradient problem, and occurs when the gradient has its magnitude increases in subsequent multiplications, particularly at a “cliff”, which is a sharp drop in the cost function in the parameter space.^{103}

See [

The rectified linear function depicted in Figure

“For a given input only a subset of neurons are active. Computation is linear on this subset ... Because of this linearity, gradients flow well on the active paths of neurons (there is no gradient vanishing effect due to activation non-linearities of sigmoid or tanh units), and mathematical investigation is easier. Computations are also cheaper: there is no need for computing the exponential function in activations, and sparsity can be exploited.”

A problem with ReLU was that some neurons were never activated, and called “dying” or “dead”, as described in [

“However, ReLU units are at a potential disadvantage during optimization because the gradient is 0 whenever the unit is not active. This could lead to cases where a unit never activates as a gradient-based optimization algorithm will not adjust the weights of a unit that never activates initially. Further, like the vanishing gradients problem, we might expect learning to be slow when training ReL networks with constant 0 gradients.”

To remedy this “dying” or “dead” neuron problem, the Leaky ReLU, proposed in [^{104}

According to Google Scholar, [

Instead of arbitrarily fixing the slope

and thus the network adaptively learned the parameters to control the leaky part of the activation function. Using the Parametric ReLU in Eq.(

For network training, i.e., to find the optimal network parameters ^{105}

A “full batch” is a complete training set of examples; see Footnote

^{106}

A minibatch is a random subset of the training set, which is called here the “full batch”; see Footnote

Figure ^{107}

See “CIFAR-10”, Wikipedia,

Deterministic optimization methods (Section

Stochastic optimization methods (Section

First-order

Classical line search with stochasticity:

The classical (old) thinking—starting in 1992 with [^{108}

See also [

^{109}

See [

The modern thinking is exemplified by Figure

Such modern practice was the motivation for research into shallow networks with

To develop a neural-network model, a dataset governed by the same probability distribution, such as the CIFAR-10 dataset mentioned above, can be typically divided into three non-overlapping subsets called

It was suggested in [^{110}

Andrew Ng suggested the following partitions. For small datasets having less than

Examples in the training set are fed into an optimizer to find the network parameter estimate ^{111}

The word “estimate” is used here for the more general case of stochastic optimization with minibatches; see Section

^{112}

An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a formal definition of “epoch”, see Section

Figure

Because of the “asymmetric U-shaped curve” of the validation error, the thinking was that if the optimization process could stop early at the global mininum of the validation error, then the generalization (test) error, i.e., the value of cost function on the test set, would also be small, thus the name “

The difference between the test (generalization) error and the validation error is called the generalization gap, as shown in the

Even the best machine learning generalization capability nowadays still cannot compete with the generalization ability of human babies; see Section

then define the

[

The issue is how to determine the generalization loss lower bound

Moreover, the above discussion is for the

^{113}

See also “Method for early stopping in a neural network”, StackExchange, 2018.03.05,

Since it is important to monitor the validation error during training, a whole section is devoted in [

Before presenting the stochastic gradient-descent (SGD) methods in Section

“One should not lose sight of the fact that [full] batch approaches possess some intrinsic advantages. First, the use full gradient information at each iterate opens the door for many deterministic gradient-based optimization methods that have been developed over the past decades, including not only the full gradient method, but also accelerated gradient, conjugate gradient, quasi-Newton, inexact Newton methods, and can benefit from parallelization.” [

Once the gradient

being the gradient direction, and ^{114}

See Figure

Otherwise, the update of the whole network parameter

where

“Neural network researchers have long realized that the learning rate is reliably one of the most difficult to set hyperparameters because it significantly affects model performance.” [

In fact, it is well known in the field of optimization, where the learning rate is often mnemonically denoted by

“We can choose

Choosing an arbitrarily small ^{115}

See, e.g., [

^{116}

See also [

Line search in deep-learning training. Line search methods are not only important for use in deterministic optimization with full batch of examples,^{117}

A full batch contains all examples in the training set. There is a confusion in the use of the word “batch” in terminologies such as “batch optimization” or “batch gradient descent”, which are used to mean the full training set, and not a subset of the training set; see, e.g., [

^{118}

See, e.g., [

^{119}

In [

In view of

Find a positive step length

is negative, i.e., the descent direction ^{120}

Or equivalently, the descent direction

The minimization problem in Eq. (^{121}

See, e.g., [

^{122}

See [

The method is inexact since the search for an acceptable step length would stop before a minimum is reached, once the rule is satisfied.^{123}

The book [

^{124}

See also [

where both the numerator and the denominator are negative, i.e., ^{125}

See [

A reason could be that the sector bounded by the two lines

The search for an appropriate step length that satisfies Eq. (^{126}

See [

Apparently without the knowledge of [^{127}

As of 2022.07.09, [

^{128}

All of these stochastic optimization methods are considered as part of a broader class known as derivative-free optimization methods [

Armijo’s rule is stated as follows: For ^{129}

[

where the decrease in the cost function along the descent direction

which is also known as the Armijo sufficient decrease condition, the first of the two Wolfe conditions presented below; see [^{130}

See also [

Regarding the paramters

and proved a convergence theorem. In practice, ^{131}

See [

^{132}

To satisfy the condition in Eq. (

^{133}

The inequality in Eq. (

where ^{134}

A narrow valley with the minimizer

The pseudocode for deterministic gradient descent with Armijo line search is Algorithm

When the Hessian

and regularized Newton method uses a descent direction based on a regularized Hessian of the form:

where ^{135}

See, e.g., [

The rule introduced in [^{136}

As of 2022.07.09, [

^{137}

The authors of [

^{138}

An earlier version of the 2017 paper [

The first Wolfe’s rule in Eq. (

The second Wolfe’s rule in Eq. (

For other variants of line search, we refer to [

To avoid confusion,^{139}

See [

In fact, as we shall see, and as mentioned in

“The learning rate may be chosen by trial and error. This is more of an art than a science, and most guidance on this subject should be regarded with some skepticism.” [

At the time of this writing, we are aware of two review papers on optimization algorithms for machine learning, and in particular deep learning, aiming particularly at experts in the field: [

Listed below are the points that distinguish the present paper from other reviews. Similar to [

Only mentioned briefly in words the connection between SGD with momentum to mechanics without detailed explanation using the equation of motion of the “heavy ball”, a name not as accurate as the original name “small heavy sphere” by Polyak (1964) [

Did not discuss recent practical add-on improvements to SGD such as step-length tuning (Section

Did not connect step-length decay to simulated annealing, and did not explain the reason for using the name “annealing”^{140}

The authors of [

Did not review an alternative to step-length decay by increasing minibatch size, which could be more efficient, as proposed in [

Did not point out that the exponential smoothing method (or running average) used in adaptive learning-rate algorithms dated since the 1950s in the field of forecasting. None of these references acknowledged the contributions made in [

Did not discuss recent adaptive learning-rate algorithms such as ^{141}

The authors of [

Did not discuss classical line-search rules—such as [^{142}

The authors of [

The stochastic gradient descent algorithm, originally introduced by Robbins & Monro (1951a) [^{143}

See, e.g., [

“Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent (SGD). Stochastic gradient descent is an extension of the gradient descent algorithm.” [

^{144}

As of 2010.04.30, the ImageNet database contained more than 14 million images; see ^{'} denote the number of examples in the training set and in the minibatch, respectively, whereas on p. 274, 𝓂 denote the number of examples in a minibatch. In our notation, 𝓂 is the dimension of the output array 𝓎, whereas 𝗆 (in a different font) is the minibatch size; see Footnote

“The minibatch size

Generated as in Eq. (^{145}

An epoch, or training session, τ is explicitly defined here as when the minibatches as generated in

Note that once the random index set

Unlike the iteration counter

where we wrote the random index as

The pseudocode for the standard SGD^{146}

See also [

“Despite the prevalent use of SGD, it has known challenges and inefficiencies. First, the direction may not represent a descent direction, and second, the method is sensitive to the step-size (learning rate) which is often poorly overestimated.” [

For the above reasons, it may not be appropriate to use the norm of the gradient estimate being small as stationarity condition, i.e., where the local minimizer or saddle point is located; see the discussion in [

Despite the above problems, SGD has been brought back to the forefront state-of-the-art algorithm to beat, surpassing the performance of adaptive methods, as confirmed by three recent papers: [

Momentum and accelerated gradient: Improve (accelerate) convergence in narrow valleys, Section

Initial-step-length tuning: Find effective initial step length

Step-length decaying or annealing: Find an effective learning-rate schedule^{147}

See Figure

Minibatch-size increase, keeping step length fixed, equivalent annealing, Section

Weight decay, Section

The standard update for gradient descent is Eq. (

from which the following methods are obtained (line

Standard SGD update Eq. (

SGD with classical momentum: ^{148}

Often called by the more colloquial “heavy ball” method; see

SGD with fast (accelerated) gradient:^{149}

Sometimes referred to as Nesterov’s Accelerated Gradient (NAG) in the deep-learning literature.

The continuous counterpart of the parameter update Eq. (

which is the same as the update Eq. (

The choice of the momentum parameter

Figure

In their remarkable paper, the authors of [^{150}

A nice animation of various optimizers (

See Figure

For more insight into the update Eq. (

i.e., without momentum for the first term. So the effective gradient is the sum of all gradients from the beginning ^{151}

See also Section

“Momentum is a simple method for increasing the speed of learning when the objective function contains long, narrow and fairly straight ravines with a gentle but consistent gradient along the floor of the ravine and much steeper gradients up the sides of the ravine. The momentum method simulates a heavy ball rolling down a surface. The ball builds up velocity along the floor of the ravine, but not across the ravine because the opposing gradients on opposite sides of the ravine cancel each other out over time.”

In recent years, Polyak (1964) [^{152}

Polyak (1964) [

^{153}

See [

^{154}

See, e.g., [

^{155}

Or the “Times Square Ball”, Wikipedia,

For Nesterov’s fast (accelerated) gradient method, many references referred to [^{156}

Reference [

^{157}

A function

The initial step length

The following simple tuning method was proposed in [

“To tune the step sizes, we evaluated a logarithmically-spaced grid of five step sizes. If the best performance was ever at one of the extremes of the grid, we would try new grid points so that the best performance was contained in the middle of the parameters. For example, if we initially tried step sizes 2, 1, 0.5, 0.25, and 0.125 and found that 2 was the best performing, we would have tried the step size 4 to see if performance was improved. If performance improved, we would have tried 8 and so on.”

The above logarithmically-spaced grid was given by

^{158}

The last two values

In the update of the parameter ^{159}

The Avant Garde font † is used to avoid confusion with

The following learning-rate scheduling, linear with respect to ^{160}

See [_{c} in Eq. (_{†c} “should be set to roughly 1 percent the value of

where ^{161}

See [_{400} = 5%∈_{0} according to Eq. (

with

Another step-length decay method proposed in [

Recall,

as an add-on to the parameter update for vanilla SGD Eq. (

as an add-on to the parameter update for SGD with momentum and accelerated gradient Eq. (

where

Figure

^{162}

Eq. (

The inequality on the left of Eq. (

In Section

The minibatch parameter update from Eq. (

where

To show that the gradient error has zero mean (average), based on the linearity of the expectation function

from

Or alternatively, the same result can be obtained with:

Next, the mean value of the “square” of the gradient error, i.e., ^{163}

See, e.g., [

where

Eq. (

where the iteration counter

Now assume the covariance matrix of any pair of single-example gradients

where ^{164}

Eq. (

The authors of [

where

where

The fluctuation factor

^{165}

In [

since their cost function was not an average, i.e., not divided by the minibatch size

It was suggested in [^{166}

See Figure

The results are shown in Figure

^{168}

“In metallurgy and materials science, annealing is a heat treatment that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves heating a material above its recrystallization temperature, maintaining a suitable temperature for a suitable amount of time, and then allow slow cooling.” Wikepedia, ‘Annealing (metallurgy)’, Version

Even though the authors of [

with the intriguing factor ^{169}

In original notation used in [

where

The column matrix (or vector)

To obtain a differential equation, Eq. (

which shows that the derivative of

The last term ^{170}

See also “Langevin equation”, Wikipedia,

where ^{171}

For first-time learners, here a guide for further reading on a derivation of Eq. (

where

The covariance of the noise

Eq. (

where

The most famous of these nature-inspired algorithms would be perhaps simulated annealing in [

For applications of these nature-inspired algorithms, we cite the following works, without detailed review: [

Reducing, or decaying, the network parameters

where

It was written in [

In the case of weight decay with cyclic annealing, both the step length

The effectiveness of SGD with weight decay, with and without cyclic annealing, is presented in Figure

To have a general parameter-update equation that combines all of the above add-on improvement tricks, start with the parameter update with momentum and accelerated gradient Eq. (