Saya akan mulai membuat daftar di sini yang sudah saya pelajari sejauh ini. Seperti yang dikatakan @marcodena, pro dan kontra lebih sulit karena kebanyakan hanya heuristik yang dipelajari dari mencoba hal-hal ini, tetapi saya kira setidaknya memiliki daftar apa yang tidak dapat mereka lukai.
Pertama, saya akan mendefinisikan notasi secara eksplisit sehingga tidak ada kebingungan:
Notasi
Notasi ini dari buku Neilsen .
Jaringan Neural Feedforward adalah banyak lapisan neuron yang terhubung bersama. Dibutuhkan dalam input, maka input itu "menetes" melalui jaringan dan jaringan saraf mengembalikan vektor output.
Lebih formal lagi, sebut aktivasi (alias output) dari neuron di lapisan , di mana adalah elemen dalam vektor input.aijjthitha1jjth
Kemudian kita dapat menghubungkan input layer berikutnya dengan sebelumnya melalui relasi berikut:
aij=σ(∑k(wijk⋅ai−1k)+bij)
dimana
- σ is the activation function,
- wijk is the weight from the kth neuron in the (i−1)th layer to the jth neuron in the ith layer,
- bij is the bias of the jth neuron in the ith layer, and
- aij represents the activation value of the jth neuron in the ith layer.
Sometimes we write zij to represent ∑k(wijk⋅ai−1k)+bij, in other words, the activation value of a neuron before applying the activation function.
For more concise notation we can write
ai=σ(wi×ai−1+bi)
To use this formula to compute the output of a feedforward network for some input I∈Rn, set a1=I, then compute a2,a3,…,am, where m is the number of layers.
Activation Functions
(in the following, we will write exp(x) instead of ex for readability)
Identity
Also known as a linear activation function.
aij=σ(zij)=zij
Step
aij=σ(zij)={01if zij<0if zij>0
Piecewise Linear
Choose some xmin and xmax, which is our "range". Everything less than than this range will be 0, and everything greater than this range will be 1. Anything else is linearly-interpolated between. Formally:
aij=σ(zij)=⎧⎩⎨⎪⎪⎪⎪0mzij+b1if zij<xminif xmin≤zij≤xmaxif zij>xmax
Where
m=1xmax−xmin
and
b=−mxmin=1−mxmax
Sigmoid
aij=σ(zij)=11+exp(−zij)
Complementary log-log
aij=σ(zij)=1−exp(−exp(zij))
Bipolar
aij=σ(zij)={−1 1if zij<0if zij>0
Bipolar Sigmoid
aij=σ(zij)=1−exp(−zij)1+exp(−zij)
Tanh
aij=σ(zij)=tanh(zij)
LeCun's Tanh
See Efficient Backprop.
aij=σ(zij)=1.7159tanh(23zij)
Scaled:
Hard Tanh
aij=σ(zij)=max(−1,min(1,zij))
Absolute
aij=σ(zij)=∣zij∣
Rectifier
Also known as Rectified Linear Unit (ReLU), Max, or the Ramp Function.
aij=σ(zij)=max(0,zij)
Modifications of ReLU
These are some activation functions that I have been playing with that seem to have very good performance for MNIST for mysterious reasons.
aij=σ(zij)=max(0,zij)+cos(zij)
Scaled:
aij=σ(zij)=max(0,zij)+sin(zij)
Scaled:
Smooth Rectifier
Also known as Smooth Rectified Linear Unit, Smooth Max, or Soft plus
aij=σ(zij)=log(1+exp(zij))
Logit
aij=σ(zij)=log(zij(1−zij))
Scaled:
Probit
aij=σ(zij)=2–√erf−1(2zij−1)
.
Where erf is the Error Function. It can't be described via elementary functions, but you can find ways of approximating it's inverse at that Wikipedia page and here.
Alternatively, it can be expressed as
aij=σ(zij)=ϕ(zij)
.
Where ϕis the Cumulative distribution function (CDF). See here for means of approximating this.
Scaled:
Cosine
See Random Kitchen Sinks.
aij=σ(zij)=cos(zij)
.
Softmax
Also known as the Normalized Exponential.
aij=exp(zij)∑kexp(zik)
This one is a little weird because the output of a single neuron is dependent on the other neurons in that layer. It also does get difficult to compute, as zij may be a very high value, in which case exp(zij) will probably overflow. Likewise, if zij is a very low value, it will underflow and become 0.
To combat this, we will instead compute log(aij). This gives us:
log(aij)=log⎛⎝⎜exp(zij)∑kexp(zik)⎞⎠⎟
log(aij)=zij−log(∑kexp(zik))
Here we need to use the log-sum-exp trick:
Let's say we are computing:
log(e2+e9+e11+e−7+e−2+e5)
We will first sort our exponentials by magnitude for convenience:
log(e11+e9+e5+e2+e−2+e−7)
Then, since e11 is our highest, we multiply by e−11e−11:
log(e−11e−11(e11+e9+e5+e2+e−2+e−7))
log(1e−11(e0+e−2+e−6+e−9+e−13+e−18))
log(e11(e0+e−2+e−6+e−9+e−13+e−18))
log(e11)+log(e0+e−2+e−6+e−9+e−13+e−18)
11+log(e0+e−2+e−6+e−9+e−13+e−18)
We can then compute the expression on the right and take the log of it. It's okay to do this because that sum is very small with respect to log(e11), so any underflow to 0 wouldn't have been significant enough to make a difference anyway. Overflow can't happen in the expression on the right because we are guaranteed that after multiplying by e−11, all the powers will be ≤0.
Formally, we call m=max(zi1,zi2,zi3,...). Then:
log(∑kexp(zik))=m+log(∑kexp(zik−m))
Our softmax function then becomes:
aij=exp(log(aij))=exp(zij−m−log(∑kexp(zik−m)))
Also as a sidenote, the derivative of the softmax function is:
dσ(zij)dzij=σ′(zij)=σ(zij)(1−σ(zij))
Maxout
This one is also a little tricky. Essentially the idea is that we break up each neuron in our maxout layer into lots of sub-neurons, each of which have their own weights and biases. Then the input to a neuron goes to each of it's sub-neurons instead, and each sub-neuron simply outputs their z's (without applying any activation function). The aij of that neuron is then the max of all its sub-neuron's outputs.
Formally, in a single neuron, say we have n sub-neurons. Then
aij=maxk∈[1,n]sijk
where
sijk=ai−1∙wijk+bijk
(∙ is the dot product)
To help us think about this, consider the weight matrix Wi for the ith layer of a neural network that is using, say, a sigmoid activation function. Wi is a 2D matrix, where each column Wij is a vector for neuron j containing a weight for every neuron in the the previous layer i−1.
If we're going to have sub-neurons, we're going to need a 2D weight matrix for each neuron, since each sub-neuron will need a vector containing a weight for every neuron in the previous layer. This means that Wi is now a 3D weight matrix, where each Wij is the 2D weight matrix for a single neuron j. And then Wijk is a vector for sub-neuron k in neuron j that contains a weight for every neuron in the previous layer i−1.
Likewise, in a neural network that is again using, say, a sigmoid activation function, bi is a vector with a bias bij for each neuron j in layer i.
To do this with sub-neurons, we need a 2D bias matrix bi for each layer i, where bij is the vector with a bias for bijk each subneuron k in the jth neuron.
Having a weight matrix wij and a bias vector bij for each neuron then makes the above expressions very clear, it's simply applying each sub-neuron's weights wijk to the outputs ai−1 from layer i−1, then applying their biases bijk and taking the max of them.
Radial Basis Function Networks
Radial Basis Function Networks are a modification of Feedforward Neural Networks, where instead of using
aij=σ(∑k(wijk⋅ai−1k)+bij)
we have one weight wijk per node k in the previous layer (as normal), and also one mean vector μijk and one standard deviation vector σijk for each node in the previous layer.
Then we call our activation function ρ to avoid getting it confused with the standard deviation vectors σijk. Now to compute aij we first need to compute one zijk for each node in the previous layer. One option is to use Euclidean distance:
zijk=∥(ai−1−μijk∥−−−−−−−−−−−√=∑ℓ(ai−1ℓ−μijkℓ)2−−−−−−−−−−−−−√
Where μijkℓ is the ℓth element of μijk. This one does not use the σijk. Alternatively there is Mahalanobis distance, which supposedly performs better:
zijk=(ai−1−μijk)TΣijk(ai−1−μijk)−−−−−−−−−−−−−−−−−−−−−−√
where Σijk is the covariance matrix, defined as:
Σijk=diag(σijk)
In other words, Σijk is the diagonal matrix with σijk as it's diagonal elements. We define ai−1 and μijk as column vectors here because that is the notation that is normally used.
These are really just saying that Mahalanobis distance is defined as
zijk=∑ℓ(ai−1ℓ−μijkℓ)2σijkℓ−−−−−−−−−−−−−−⎷
Where σijkℓ is the ℓth element of σijk. Note that σijkℓ must always be positive, but this is a typical requirement for standard deviation so this isn't that surprising.
If desired, Mahalanobis distance is general enough that the covariance matrix Σijk can be defined as other matrices. For example, if the covariance matrix is the identity matrix, our Mahalanobis distance reduces to the Euclidean distance. Σijk=diag(σijk) is pretty common though, and is known as normalized Euclidean distance.
Either way, once our distance function has been chosen, we can compute aij via
aij=∑kwijkρ(zijk)
In these networks they choose to multiply by weights after applying the activation function for reasons.
This describes how to make a multi-layer Radial Basis Function network, however, usually there is only one of these neurons, and its output is the output of the network. It's drawn as multiple neurons because each mean vector μijk and each standard deviation vector σijk of that single neuron is considered a one "neuron" and then after all of these outputs there is another layer that takes the sum of those computed values times the weights, just like aij above. Splitting it into two layers with a "summing" vector at the end seems odd to me, but it's what they do.
Also see here.
Radial Basis Function Network Activation Functions
Gaussian
ρ(zijk)=exp(−12(zijk)2)
Multiquadratic
Choose some point (x,y). Then we compute the distance from (zij,0) to (x,y):
ρ(zijk)=(zijk−x)2+y2−−−−−−−−−−−−√
This is from Wikipedia. It isn't bounded, and can be any positive value, though I am wondering if there is a way to normalize it.
When y=0, this is equivalent to absolute (with a horizontal shift x).
Inverse Multiquadratic
Same as quadratic, except flipped:
ρ(zijk)=1(zijk−x)2+y2−−−−−−−−−−−−√
*Graphics from intmath's Graphs using SVG.