Backpropagation dengan Softmax / Cross Entropy


40

Saya mencoba memahami bagaimana backpropagation bekerja untuk lapisan output softmax / cross-entropy.

Fungsi kesalahan lintas entropi adalah

E(t,o)=jtjlogoj

dengan t dan o sebagai target dan output pada neuron j , masing-masing. Jumlahnya adalah di atas setiap neuron di lapisan output. oj itu sendiri merupakan hasil dari fungsi Softmax:

oj=softmax(zj)=ezjjezj

Sekali lagi, jumlahnya adalah lebih dari setiap neuron di lapisan output dan zj adalah input ke neuron j :

zj=iwijoi+b

Itu adalah jumlah seluruh neuron di lapisan sebelumnya dengan output yang sesuai mereka oi dan berat wij menuju neuron j ditambah bias b .

Sekarang, untuk memperbarui bobot wij yang menghubungkan neuron j di lapisan output dengan neuron i di lapisan sebelumnya, saya perlu menghitung turunan parsial dari fungsi kesalahan menggunakan aturan rantai:

Ewij=Eojojzjzjwij

dengan zj sebagai input ke neuron j .

Istilah terakhir cukup sederhana. Karena hanya ada satu bobot antara i dan j , turunannya adalah:

zjwij=oi

Istilah pertama adalah derivasi dari fungsi kesalahan sehubungan dengan output oj :

Eoj=tjoj

Jangka menengah adalah derivasi dari fungsi Softmax sehubungan dengan input zj lebih sulit:

ojzj=zjezjjezj

Katakanlah kita memiliki tiga neuron output yang sesuai dengan kelas maka o b = s o f t m a xa,b,c adalah:ob=softmax(b)

ob=ezbez=ezbeza+ezb+ezc

dan turunannya menggunakan aturan hasil bagi:

=softmax(b)-softmax2(b)=ob-o 2 b =ob(1-ob) Kembali ke jangka menengah untuk backpropagation ini berarti: oj

obzb=ezbez(ezb)2(jez)2=ezbez(ezb)2(ez)2
=softmax(b)softmax2(b)=obob2=ob(1ob)
ojzj=oj(1oj)

Menyatukan semuanya saya dapatkan

Ewij=tjojoj(1oj)oi=tj(1oj)oi

yang berarti, jika target untuk kelas ini adalah , maka saya tidak akan memperbarui bobot untuk ini. Itu kedengarannya tidak benar.tj=0

Menyelidiki hal ini saya menemukan orang yang memiliki dua varian untuk derivasi softmax, satu di mana dan yang lainnya untuk i ji=jij , seperti di sini atau di sini .

Tapi aku tidak bisa memahaminya. Juga saya bahkan tidak yakin apakah ini penyebab kesalahan saya, itulah sebabnya saya memposting semua perhitungan saya. Saya harap seseorang dapat mengklarifikasi saya di mana saya kehilangan sesuatu atau salah.


Tautan yang Anda berikan menghitung turunan relatif terhadap input, sementara Anda menghitung turunan relatif terhadap bobot.
Jenkar

Jawaban:


35

Catatan: Saya bukan ahli backprop, tetapi sekarang setelah membaca sedikit, saya pikir peringatan berikut ini sesuai. Ketika membaca makalah atau buku tentang jaring saraf, tidak jarang derivatif ditulis menggunakan campuran dari penjumlahan standar / notasi indeks , notasi matriks , dan notasi multi-indeks (termasuk hibrida dari dua terakhir untuk turunan tensor-tensor) ). Biasanya maksudnya adalah bahwa ini harus "dipahami dari konteks", jadi Anda harus berhati-hati!

Saya perhatikan beberapa inkonsistensi dalam derivasi Anda. Saya tidak benar-benar melakukan neural networks, jadi yang berikut ini mungkin salah. Namun, berikut adalah bagaimana saya akan menyelesaikan masalah.

Pertama, Anda harus memperhitungkan penjumlahan dalam , dan Anda tidak dapat mengasumsikan setiap istilah hanya tergantung pada satu berat. Jadi dengan mengambil gradien E sehubungan dengan komponen k dari z , kita memiliki E = -EEkz

E=jtjlogojEzk=jtjlogojzk

Kemudian, menyatakan sebagai o j = 1oj kita memiliki log o j

oj=1Ωezj,Ω=iezilogoj=zjlogΩ
manaδjkadalahdelta Kronecker. Maka gradien softmax-denominator adalah Ω
logojzk=δjk1ΩΩzk
δjk yang menghasilkan log
Ωzk=ieziδik=ezk
atau, memperluas log oj
logojzk=δjkok
Perhatikan bahwa derivatif adalah sehubungan denganzk, sebuahsewenang-wenangkomponenz, yang memberikanδjkjangka (=1hanya jikak=j).
ojzk=oj(δjkok)
zkzδjk=1k=j

Jadi gradien sehubungan dengan z adalah EEz mana τ=Σjtjadalah konstan (untuk diberikantvektor).

Ezk=jtj(okδjk)=ok(jtj)tkEzk=okτtk
τ=jtjt

Ini menunjukkan perbedaan pertama dari hasil Anda: tidak lagi mengalikan o k . Perhatikan bahwa untuk kasus tipikal di mana t adalah "panas sekali" yang kita milikitkoktτ=1 (as noted in your first link).

oz seems unlikely to be the "o" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?

y

zk=iwikyi+bkzkwpq=iyiwikwpq=iyiδipδkq=δkqyp

Finally, to get the gradient of E with respect to the weight-matrix w, we use the chain rule

Ewpq=kEzkzkwpq=k(okτtk)δkqyp=yp(oqτtq)
giving the final expression (assuming a one-hot t, i.e. τ=1)
Ewij=yi(ojtj)
where y is the input on the lowest level (of your example).

So this shows a second difference from your result: the "oi" should presumably be from the level below z, which I call y, rather than the level above z (which is o).

Hopefully this helps. Does this result seem more consistent?

Update: In response to a query from the OP in the comments, here is an expansion of the first step. First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have

Ewpq=iEoioiwpq
and
oiwpq=koizkzkwpq
so
Ewpq=i[Eoi(koizkzkwpq)]
In practice the full summations reduce, because you get a lot of δab terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.

I am not certain how the "Backprop/AutoDiff" community does these problems, but I find any time I try to take shortcuts, I am liable to make errors. So I end up doing as here, writing everything out in terms of summations with full subscripting, and always introducing new subscripts for every derivative. (Similar to my answer here ... I hope I am at least giving correct results in the end!)
GeoMatt22

I personally find that you writing everything down makes it much easier to follow. The results look correct to me.
Jenkar

Although I'm still trying to fully understand each of your steps, I got some valuable insights that helped me with the overall picture. I guess I need to read more into the topic of derivations and sums. But taking your advise to take account of the summation in E, I came up with this:
micha

for two outputs oj1=ezj1Ω and oj1=ezj1Ω with
Ω=ezj1+ezj2
the cross entropy error is
E=(t1logoj1+t2logoj2)=(t1(zj1log(Ω))+t2(zj2log(Ω)))
Then the derivative is
E(zj1=(t1t1ezj1Ωt2ezj2Ω)=t1+oj1(t1+t2)
which conforms with your result... taking in account that you didn't have the minus sign before the error sum
micha

But a further question I have is: Instead of
Ewij=Eojojzjzjwij
which is generally what your introduced to with backpropagation, you calculated:
Ewij=Ezjzjwij
as like to cancel out the oj . Why is this way leading to the right result?
micha

12

While @GeoMatt22's answer is correct, I personally found it very useful to reduce the problem to a toy example and draw a picture:

Graphical model.

I then defined the operations each node was computing, treating the h's and w's as inputs to a "network" (t is a one-hot vector representing the class label of the data point):

L=t1logo1t2logo2
o1=exp(y1)exp(y1)+exp(y2)
o2=exp(y2)exp(y1)+exp(y2)
y1=w11h1+w21h2+w31h3
y2=w12h1+w22h2+w32h3

Say I want to calculate the derivative of the loss with respect to w21. I can just use my picture to trace back the path from the loss to the weight I'm interested in (removed the second column of w's for clarity):

Graphical model with highlighted backwards path.

Then, I can just calculate the desired derivatives. Note that there are two paths through y1 that lead to w21, so I need to sum the derivatives that go through each of them.

Lo1=t1o1
Lo2=t2o2
o1y1=exp(y1)exp(y1)+exp(y2)(exp(y1)exp(y1)+exp(y2))2=o1(1o1)
o2y1=exp(y2)exp(y1)(exp(y1)+exp(y2))2=o2o1
y1w21=h2

Finally, putting the chain rule together:

Lw21=Lo1o1y1y1w21+Lo2o2y1y1w21=t1o1[o1(1o1)]h2+t2o2(o2o1)h2=h2(t2o1t1+t1o1)=h2(o1(t1+t2)t1)=h2(o1t1)

Note that in the last step, t1+t2=1 because the vector t is a one-hot vector.


This is what finally cleared this up for me! Excellent and Elegant explanation!!!!
SantoshGupta7

2
I’m glad you both enjoyed and benefited from reading my post! It was also helpful for me to write it out and explain it.
Vivek Subramanian

@VivekSubramanian should it be
=t1o1[o1(1o1)]h2+t2o2(o2o1)h2
instead ?
koryakinp

You’re right - it was a typo! I will make the change.
Vivek Subramanian

The thing i do not understand here is that you also assign logits (unscaled scores) to some neurons. (o is softmaxed logits (predictions) and y is logits in your case). However, this is not the case normally, is not it? Look at this picture ( o_out1 is prediction and o_in1 is logits) so how is it possible in this case how can you find the partial derivative of o2 with respect to y1?
ARAT

6

In place of the {oi}, I want a letter whose uppercase is visually distinct from its lowercase. So let me substitute {yi}. Also, let's use the variable {pi} to designate the {oi} from the previous layer.

Let Y be the diagonal matrix whose diagonal equals the vector y, i.e.

Y=Diag(y)
Using this new matrix variable and the Frobenius Inner Product we can calculate the gradient of E wrt W.
z=Wp+bdz=dWpy=softmax(z)dy=(YyyT)dzE=t:log(y)dE=t:Y1dydE=t:Y1(YyyT)dz=t:(I1yT)dz=t:(I1yT)dWp=(y1TI)tpT:dW=((1Tt)ypTtpT):dWEW=(1Tt)ypTtpT

6

Here is one of the cleanest and well written notes that I came across the web which explains about "calculation of derivatives in backpropagation algorithm with cross entropy loss function".


In the given pdf how did equation 22 become equation 23? As in how did the Summation(k!=i) get a negative sign. Shouldn't it get a positive sign? Like Summation(Fn)(For All K) = Fn(k=i) + Summation(Fn)(k!=i) should be happening according to my understanding.
faizan

1

Here's a link explaining the softmax and its derivative.

It explains the reason for using i=j and i!=j.


It is recommended to provide a minimal, stand-alone answer, in case that link gets broken in the future. Otherwise, this might no longer help other users in the future.
luchonacho

0

Other answers have provided the correct way of calculating the derivative, but they do not point out where you have gone wrong. In fact, tj is always 1 in your last equation, cause you have assumed that oj takes that node of target 1 in your output; oj of other nodes have different forms of probability function, thus lead to different forms of derivative, so you should now understand why other people have treated i=j and ij differently.

Dengan menggunakan situs kami, Anda mengakui telah membaca dan memahami Kebijakan Cookie dan Kebijakan Privasi kami.
Licensed under cc by-sa 3.0 with attribution required.