Matriks berupa backpropagation dengan normalisasi batch


12

Normalisasi batch telah dikreditkan dengan peningkatan kinerja substansial dalam jaring saraf yang dalam. Banyak materi di internet menunjukkan cara mengimplementasikannya berdasarkan aktivasi-demi-aktivasi. Saya sudah menerapkan backprop menggunakan aljabar matriks, dan mengingat bahwa saya bekerja dalam bahasa tingkat tinggi (sambil mengandalkan Rcpp(dan akhirnya GPU) untuk perkalian matriks padat), merobek semuanya dan beralih ke for-loop mungkin akan memperlambat kode saya secara substansial, selain menjadi sangat sakit.

Fungsi normalisasi bets adalah mana

b(xp)=γ(xpμxp)σxp1+β
  • xp adalah simpul th, sebelum itu akan diaktifkanp
  • γ dan adalah parameter skalarβ
  • μxp dan adalah rata-rata dan SD dari . (Perhatikan bahwa akar kuadrat dari varians ditambah faktor fudge biasanya digunakan - mari kita asumsikan elemen bukan nol untuk kekompakan)σxpxp

Dalam bentuk matriks, normalisasi bets untuk seluruh lapisan adalah mana

b(X)=(γ1p)(XμX)σX1+(β1p)
  • X adalahN×p
  • 1N is a column vector of ones
  • γ and β are now row p-vectors of the per-layer normalization parameters
  • μX and σX are N×p matrices, where each column is a N-vector of columnwise means and standard deviations
  • is the Kronecker product and is the elementwise (Hadamard) product

A very simple one-layer neural net with no batch normalization and a continuous outcome is

y=a(XΓ1)Γ2+ϵ

where

  • Γ1 is p1×p2
  • Γ2 is p2×1
  • a(.) is the activation function

If the loss is R=N1(yy^)2, then the gradients are

RΓ1=2VTϵ^RΓ2=XT(a(XΓ1)2ϵ^Γ2T)

where

  • V=a(XΓ1)
  • ϵ^=yy^

Under batch normalization, the net becomes

y=a(b(XΓ1))Γ2
or
y=a((γ1N)(XΓ1μXΓ1)σXΓ11+(β1N))Γ2
I have no idea how to compute the derivatives of Hadamard and Kronecker products. On the subject of Kronecker products, the literature gets fairly arcane.

Is there a practical way of computing R/γ, R/β, and R/Γ1 within the matrix framework? A simple expression, without resorting to node-by-node computation?

Update 1:

I've figured out R/β -- sort of. It is:

1NT(a(XΓ1)2ϵ^Γ2T)
Some R code demonstrates that this is equivalent to the looping way to do it. First set up the fake data:
set.seed(1)
library(dplyr)
library(foreach)

#numbers of obs, variables, and hidden layers
N <- 10
p1 <- 7
p2 <- 4
a <- function (v) {
  v[v < 0] <- 0
  v
}
ap <- function (v) {
  v[v < 0] <- 0
  v[v >= 0] <- 1
  v
}

# parameters
G1 <- matrix(rnorm(p1*p2), nrow = p1)
G2 <- rnorm(p2)
gamma <- 1:p2+1
beta <- (1:p2+1)*-1
# error
u <- rnorm(10)

# matrix batch norm function
b <- function(x, bet = beta, gam = gamma){
  xs <- scale(x)
  gk <- t(matrix(gam)) %x% matrix(rep(1, N))
  bk <- t(matrix(bet)) %x% matrix(rep(1, N))
  gk*xs+bk
}
# activation-wise batch norm function
bi <- function(x, i){
  xs <- scale(x)
  gk <- t(matrix(gamma[i]))
  bk <- t(matrix(beta[i]))
  suppressWarnings(gk*xs[,i]+bk)
}

X <- round(runif(N*p1, -5, 5)) %>% matrix(nrow = N)
# the neural net
y <- a(b(X %*% G1)) %*% G2 + u

Then compute derivatives:

# drdbeta -- the matrix way
drdb <- matrix(rep(1, N*1), nrow = 1) %*% (-2*u %*% t(G2) * ap(b(X%*%G1)))
drdb
           [,1]      [,2]    [,3]        [,4]
[1,] -0.4460901 0.3899186 1.26758 -0.09589582
# the looping way
foreach(i = 1:4, .combine = c) %do%{
  sum(-2*u*matrix(ap(bi(X[,i, drop = FALSE]%*%G1[i,], i)))*G2[i])
}
[1] -0.44609015  0.38991862  1.26758024 -0.09589582

They match. But I'm still confused, because I don't really know why this works. The MatCalc notes referenced by @Mark L. Stone say that the derivative of β1N should be

ABA=(InqTmp)(Invec(B)Im)
where the subscripts m, n, and p, q are the dimensions of A and B. T is the commutation matrix, which is just 1 here because both inputs are vectors. I try this and get a result that doesn't seem helpful:
# playing with the kroneker derivative rule
A <- t(matrix(beta)) 
B <- matrix(rep(1, N))
diag(rep(1, ncol(A) *ncol(B))) %*% diag(rep(1, ncol(A))) %x% (B) %x% diag(nrow(A))
     [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 snip
[13,]    0    1    0    0
[14,]    0    1    0    0
snip
[28,]    0    0    1    0
[29,]    0    0    1    0
[snip
[39,]    0    0    0    1
[40,]    0    0    0    1

This isn't conformable. Clearly I'm not understanding those Kronecker derivative rules. Help with those would be great. I'm still totally stuck on the other derivatives, for γ and Γ1 -- those are harder because they don't enter additively like β1 does.

Update 2

Reading textbooks, I'm fairly sure that R/Γ1 and R/γ will require use of the vec() operator. But I'm apparently unable to follow the derivations sufficiently as to be able to translate them into code. For example, R/Γ1 is going to involve taking the derivative of wXΓ1 with respect to Γ1, where w(γ1)σXΓ11 (which we can treat as a constant matrix for the moment).

My instinct is to simply say "the answer is wX", but that obviously doesn't work because w isn't conformable with X.

I know that

(AB)=AB+AB

and from this, that

vec(wXΓ1)vec(Γ1)T=vec(XΓ1)Ivec(w)vec(Γ1)T+vec(w)Ivec(XΓ1)vec(Γ1)T
But I'm uncertain how to evaluate this, let alone code it.

Update 3

Making progress here. I woke up at 2AM last night with this idea. Math is not good for sleep.

Here is R/Γ1, after some notational sugar:

  • w(γ1)σXΓ11
  • "stub"a(b(XΓ1))2ϵ^Γ2T

Here's what you have after you get to the end of the chain rule:

RΓ1=wXΓ1Γ1("stub")
Start by doing this the looping way -- i and j will subscript columns and I is a conformable identity matrix:
RΓij=(wiXi)T("stub"j)
RΓij=(IwiXi)T("stub"j)
RΓij=XiTIwi("stub"j)
tl;dr you're basically pre-multiplying the stub by the batchnorm scale factors. This should be equivalent to:
RΓ=XT("stub"w)

And, in fact it is:

stub <- (-2*u %*% t(G2) * ap(b(X%*%G1)))
w <- t(matrix(gamma)) %x% matrix(rep(1, N)) * (apply(X%*%G1, 2, sd) %>% t %x% matrix(rep(1, N)))
drdG1 <- t(X) %*% (stub*w)

loop_drdG1 <- drdG1*NA
for (i in 1:7){
  for (j in 1:4){
    loop_drdG1[i,j] <- t(X[,i]) %*% diag(w[,j]) %*% (stub[,j])
  }
}

> loop_drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965
> drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965

Update 4

Here, I think, is R/γ. First

  • XΓ~(XΓμXΓ)σXΓ1
  • γ~γ1N

Similar to before, the chain rule gets you as far as

Rγ~=γ~XΓ~γ~("stub")
Looping gives you
Rγ~i=(XΓ~)iTIγ~i("stub"i)
Which, like before, is basically pre-multiplying the stub. It should therefore be equivalent to:
Rγ~=(XΓ~)T("stub"γ~)

It sort of matches:

drdg <- t(scale(X %*% G1)) %*% (stub * t(matrix(gamma)) %x% matrix(rep(1, N)))

loop_drdg <- foreach(i = 1:4, .combine = c) %do% {
  t(scale(X %*% G1)[,i]) %*% (stub[,i, drop = F] * gamma[i])  
}

> drdg
           [,1]      [,2]       [,3]       [,4]
[1,]  0.8580574 -1.125017  -4.876398  0.4611406
[2,] -4.5463304  5.960787  25.837103 -2.4433071
[3,]  2.0706860 -2.714919 -11.767849  1.1128364
[4,] -8.5641868 11.228681  48.670853 -4.6025996
> loop_drdg
[1]   0.8580574   5.9607870 -11.7678486  -4.6025996

The diagonal on the first is the same as the vector on the second. But really since the derivative is with respect to a matrix -- albeit one with a certain structure, the output should be a similar matrix with the same structure. Should I take the diagonal of the matrix approach and simply take it to be γ? I'm not sure.

It seems that I have answered my own question but I am unsure whether I am correct. At this point I will accept an answer that rigorously proves (or disproves) what I've sort of hacked together.

while(not_answered){
  print("Bueller?")
  Sys.sleep(1)
}

2
Chapter 9 section 14 of "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Magnus and Neudecker, 3rd edition janmagnus.nl/misc/mdc2007-3rdedition covers differentials of Kronecker products and concludes with an exercise on differential of Hadamard product. "Notes on Matrix Calculus" by Paul L. Fackler www4.ncsu.edu/~pfackler/MatCalc.pdf has a lot of material on differentiating Kronceker products
Mark L. Stone

Thanks for the references. I've found those MatCalc notes before, but it doesn't cover Hadamard, and anyway I'm never certain whether a rule from non-matrix calculus applies or doesn't apply to to matrix case. Product rules, chain rules, etc. I'll look into the book. I'd accept an answer that points me to all of the ingredients I need to pencil it out myself...
generic_user

why are you doing this? why not use framewroks such as Keras/TensorFlow? It's a waste of productive time to implement these low level algorithms, that you could use on solving actual problems
Aksakal

1
More precisely, I'm fitting networks that exploit known parametric structure -- both in terms of linear-in-parameters representations of input data, as well as longitudinal/panel structure. Established frameworks are so heavily optimized as to be beyond my ability to hack/modify. Plus math is helpful generally. Plenty of codemonkeys have no idea what they're doing. Likewise learning enough Rcpp to implement it efficiently is useful.
generic_user

1
@MarkL.Stone not only is it theoretically sound, it's practically easy! A more or less mechanical process! &%#$!
generic_user

Jawaban:


1

Not a complete answer, but to demonstrate what I suggested in my comment if

b(X)=(XeNμXT)ΓΣX1/2+eNβT
where Γ=diag(γ), ΣX1/2=diag(σX11,σX21,) and eN is a vector of ones, then by the chain rule
βR=[2ϵ^(Γ2TI)JX(a)(IeN)]T
Noting that 2ϵ^(Γ2TI)=vec(2ϵ^Γ2T)T and JX(a)=diag(vec(a(b(XΓ1)))), we see that
βR=(IeNT)vec(a(b(XΓ1))2ϵ^Γ2T)=eNT(a(b(XΓ1))2ϵ^Γ2T)
via the identity vec(AXB)=(BTA)vec(X). Similarly,
γR=[2ϵ^(Γ2TI)JX(a)(ΣXΓ11/2(XΓ1eNμXΓ1T))K]T=KTvec((XΓ1eNμXΓ1T)TWΣXΓ11/2)=diag((XΓ1eNμXΓ1T)TWΣXΓ11/2)
where W=a(b(XΓ1))2ϵ^Γ2T (the "stub") and K is an Np×p binary matrix that selects the columns of the Kronecker product corresponding to the diagonal elements of a square matrix. This follows from the fact that dΓij=0. Unlike the first gradient, this expression is not equivalent to the expression you derived. Considering that b is a linear function w.r.t γi, there should not be a factor of γi in the gradient. I leave the gradient of Γ1 to the OP, but I will say for derivation with fixed w creates the "explosion" the writers of the article seek to avoid. In practice, you will also need to find the Jacobians of ΣX and μX w.r.t X and use product rule.
Dengan menggunakan situs kami, Anda mengakui telah membaca dan memahami Kebijakan Cookie dan Kebijakan Privasi kami.
Licensed under cc by-sa 3.0 with attribution required.