Backpropagation dengan Softmax / Cross Entropy

40

Saya mencoba memahami bagaimana backpropagation bekerja untuk lapisan output softmax / cross-entropy.

Fungsi kesalahan lintas entropi adalah

E (t, o) = - \sum_{j} t_{j} \log o_{j}

$E(t,o)=-\sum_j t_j \log o_j$

dengan $t$ dan $o$ sebagai target dan output pada neuron $j$ , masing-masing. Jumlahnya adalah di atas setiap neuron di lapisan output. $o_j$ itu sendiri merupakan hasil dari fungsi Softmax:

o_{j} = s o f t m a x (z_{j}) = \frac{e^{z_{j}}}{\sum_{j} e^{z_{j}}}

$o_j=softmax(z_j)=\frac{e^{z_j}}{\sum_j e^{z_j}}$

Sekali lagi, jumlahnya adalah lebih dari setiap neuron di lapisan output dan $z_j$ adalah input ke neuron $j$ :

z_{j} = \sum_{i} w_{i j} o_{i} + b

$z_j=\sum_i w_{ij}o_i+b$

Itu adalah jumlah seluruh neuron di lapisan sebelumnya dengan output yang sesuai mereka $o_i$ dan berat $w_{ij}$ menuju neuron $j$ ditambah bias $b$ .

Sekarang, untuk memperbarui bobot $w_{ij}$ yang menghubungkan neuron $j$ di lapisan output dengan neuron $i$ di lapisan sebelumnya, saya perlu menghitung turunan parsial dari fungsi kesalahan menggunakan aturan rantai:

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$

dengan $z_j$ sebagai input ke neuron $j$ .

Istilah terakhir cukup sederhana. Karena hanya ada satu bobot antara $i$ dan $j$ , turunannya adalah:

\frac{\partial z_{j}}{\partial w_{i j}} = o_{i}

$\frac{\partial z_j} {\partial w_{ij}}=o_i$

Istilah pertama adalah derivasi dari fungsi kesalahan sehubungan dengan output $o_j$ :

\frac{\partial E}{\partial o_{j}} = \frac{- t_{j}}{o_{j}}

$\frac{\partial E} {\partial o_j} = \frac{-t_j}{o_j}$

Jangka menengah adalah derivasi dari fungsi Softmax sehubungan dengan input $z_j$ lebih sulit:

\frac{\partial o_{j}}{\partial z_{j}} = \frac{\partial}{\partial z_{j}} \frac{e^{z_{j}}}{\sum_{j} e^{z_{j}}}

$\frac{\partial o_j} {\partial z_{j}}=\frac{\partial} {\partial z_{j}} \frac{e^{z_j}}{\sum_j e^{z_j}}$

Katakanlah kita memiliki tiga neuron output yang sesuai dengan kelas maka $a,b,c$ adalah: $o_b = softmax(b)$

o_{b} = \frac{e^{z_{b}}}{\sum e^{z}} = \frac{e^{z_{b}}}{e^{z_{a}} + e^{z_{b}} + e^{z_{c}}}

$o_b=\frac{e^{z_b}}{\sum e^{z}}=\frac{e^{z_b}}{e^{z_a}+e^{z_b}+e^{z_c}}$

dan turunannya menggunakan aturan hasil bagi:

Kembali ke jangka menengah untuk backpropagation ini berarti:

\frac{\partial o_{b}}{\partial z_{b}} = \frac{e^{z_{b}} * \sum e^{z} - (e^{z_{b}})^{2}}{(\sum_{j} e^{z})^{2}} = \frac{e^{z_{b}}}{\sum e^{z}} - \frac{(e^{z_{b}})^{2}}{(\sum e^{z})^{2}}

$\frac{\partial o_b} {\partial z_{b}}=\frac{e^{z_b}*\sum e^z - (e^{z_b})^2}{(\sum_j e^{z})^2}=\frac{e^{z_b}}{\sum e^z}-\frac{(e^{z_b})^2}{(\sum e^z)^2}$

= s o f t m a x (b) - s o f t m a x^{2} (b) = o_{b} - o_{b}^{2} = o_{b} (1 - o_{b})

$=softmax(b)-softmax^2(b)=o_b-o_b^2=o_b(1-o_b)$

\frac{\partial o_{j}}{\partial z_{j}} = o_{j} (1 - o_{j})

$\frac{\partial o_j} {\partial z_{j}}=o_j(1-o_j)$

Menyatukan semuanya saya dapatkan

\frac{\partial E}{\partial w_{i j}} = \frac{- t_{j}}{o_{j}} * o_{j} (1 - o_{j}) * o_{i} = - t_{j} (1 - o_{j}) * o_{i}

$\frac{\partial E} {\partial w_{ij}}= \frac{-t_j}{o_j}*o_j(1-o_j)*o_i=-t_j(1-o_j)*o_i$

yang berarti, jika target untuk kelas ini adalah , maka saya tidak akan memperbarui bobot untuk ini. Itu kedengarannya tidak benar. $t_j=0$

Menyelidiki hal ini saya menemukan orang yang memiliki dua varian untuk derivasi softmax, satu di mana dan yang lainnya untuk $i=j$ $i\ne j$ , seperti di sini atau di sini .

Tapi aku tidak bisa memahaminya. Juga saya bahkan tidak yakin apakah ini penyebab kesalahan saya, itulah sebabnya saya memposting semua perhitungan saya. Saya harap seseorang dapat mengklarifikasi saya di mana saya kehilangan sesuatu atau salah.

— micha
sumber

Tautan yang Anda berikan menghitung turunan relatif terhadap input, sementara Anda menghitung turunan relatif terhadap bobot.

— Jenkar

35

Catatan: Saya bukan ahli backprop, tetapi sekarang setelah membaca sedikit, saya pikir peringatan berikut ini sesuai. Ketika membaca makalah atau buku tentang jaring saraf, tidak jarang derivatif ditulis menggunakan campuran dari penjumlahan standar / notasi indeks , notasi matriks , dan notasi multi-indeks (termasuk hibrida dari dua terakhir untuk turunan tensor-tensor) ). Biasanya maksudnya adalah bahwa ini harus "dipahami dari konteks", jadi Anda harus berhati-hati!

Saya perhatikan beberapa inkonsistensi dalam derivasi Anda. Saya tidak benar-benar melakukan neural networks, jadi yang berikut ini mungkin salah. Namun, berikut adalah bagaimana saya akan menyelesaikan masalah.

Pertama, Anda harus memperhitungkan penjumlahan dalam , dan Anda tidak dapat mengasumsikan setiap istilah hanya tergantung pada satu berat. Jadi dengan mengambil gradien sehubungan dengan komponen dari , kita memiliki $E$ $E$ $k$ $z$

E = - \sum_{j} t_{j} \log o_{j} ⟹ \frac{\partial E}{\partial z_{k}} = - \sum_{j} t_{j} \frac{\partial \log o_{j}}{\partial z_{k}}

$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$

Kemudian, menyatakan sebagai $o_j$ kita memiliki

o_{j} = \frac{1}{Ω} e^{z_{j}}, Ω = \sum_{i} e^{z_{i}} ⟹ \log o_{j} = z_{j} - \log Ω

$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$

mana

adalahdelta Kronecker. Maka gradien softmax-denominator adalah

\frac{\partial \log o_{j}}{\partial z_{k}} = δ_{j k} - \frac{1}{Ω} \frac{\partial Ω}{\partial z_{k}}

$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$

δ_{j k}

$\delta_{jk}$

yang menghasilkan

\frac{\partial Ω}{\partial z_{k}} = \sum_{i} e^{z_{i}} δ_{i k} = e^{z_{k}}

$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$

atau, memperluas log

\frac{\partial \log o_{j}}{\partial z_{k}} = δ_{j k} - o_{k}

$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$

Perhatikan bahwa derivatif adalah sehubungan dengan

, sebuahsewenang-wenangkomponen

, yang memberikan

jangka (

hanya jika

).

\frac{\partial o_{j}}{\partial z_{k}} = o_{j} (δ_{j k} - o_{k})

$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$

z_{k}

$z_k$

z

$z$

δ_{j k}

$\delta_{jk}$

= 1

$=1$

k = j

$k=j$

Jadi gradien sehubungan dengan adalah $E$ $z$ mana adalah konstan (untuk diberikanvektor).

\frac{\partial E}{\partial z_{k}} = \sum_{j} t_{j} (o_{k} - δ_{j k}) = o_{k} (\sum_{j} t_{j}) - t_{k} ⟹ \frac{\partial E}{\partial z_{k}} = o_{k} τ - t_{k}

$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$

τ = \sum_{j} t_{j}

$\tau=\sum_jt_j$

t

$t$

Ini menunjukkan perbedaan pertama dari hasil Anda: tidak lagi mengalikan . Perhatikan bahwa untuk kasus tipikal di mana adalah "panas sekali" yang kita miliki $t_k$ $o_k$ $t$ $\tau=1$ (as noted in your first link).

$o$ $z$ seems unlikely to be the " $o$ " that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?

$y$

z_{k} = \sum_{i} w_{i k} y_{i} + b_{k} ⟹ \frac{\partial z_{k}}{\partial w_{p q}} = \sum_{i} y_{i} \frac{\partial w_{i k}}{\partial w_{p q}} = \sum_{i} y_{i} δ_{i p} δ_{k q} = δ_{k q} y_{p}

$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$

Finally, to get the gradient of $E$ with respect to the weight-matrix $w$ , we use the chain rule

\frac{\partial E}{\partial w_{p q}} = \sum_{k} \frac{\partial E}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}} = \sum_{k} (o_{k} τ - t_{k}) δ_{k q} y_{p} = y_{p} (o_{q} τ - t_{q})

$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$ giving the final expression (assuming a one-hot

t

$t$ , i.e.

τ = 1

$\tau=1$ )

\frac{\partial E}{\partial w_{i j}} = y_{i} (o_{j} - t_{j})

$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$ where

y

$y$ is the input on the lowest level (of your example).

So this shows a second difference from your result: the " $o_i$ " should presumably be from the level below $z$ , which I call $y$ , rather than the level above $z$ (which is $o$ ).

Hopefully this helps. Does this result seem more consistent?

Update: In response to a query from the OP in the comments, here is an expansion of the first step. First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have
$\frac{\partial E}{\partial w_{p q}} = \sum_{i} \frac{\partial E}{\partial o_{i}} \frac{\partial o_{i}}{\partial w_{p q}}$ $\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$ and $\frac{\partial o_{i}}{\partial w_{p q}} = \sum_{k} \frac{\partial o_{i}}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}}$ $\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$ so $\frac{\partial E}{\partial w_{p q}} = \sum_{i} [\frac{\partial E}{\partial o_{i}} (\sum_{k} \frac{\partial o_{i}}{\partial z_{k}} \frac{\partial z_{k}}{\partial w_{p q}})]$ $\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$ In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.

— GeoMatt22
sumber

I am not certain how the "Backprop/AutoDiff" community does these problems, but I find any time I try to take shortcuts, I am liable to make errors. So I end up doing as here, writing everything out in terms of summations with full subscripting, and always introducing new subscripts for every derivative. (Similar to my answer here ... I hope I am at least giving correct results in the end!)

— GeoMatt22

I personally find that you writing everything down makes it much easier to follow. The results look correct to me.

— Jenkar

Although I'm still trying to fully understand each of your steps, I got some valuable insights that helped me with the overall picture. I guess I need to read more into the topic of derivations and sums. But taking your advise to take account of the summation in E, I came up with this:

— micha

for two outputs

o_{j_{1}} = \frac{e^{z_{j_{1}}}}{Ω}

$o_{j_1}=\frac{e^{z_{j_1}}}{\Omega}$ and

o_{j_{1}} = \frac{e^{z_{j_{1}}}}{Ω}

$o_{j_1}=\frac{e^{z_{j_1}}}{\Omega}$ with

Ω = e^{z_{j_{1}}} + e^{z_{j_{2}}}

$\Omega=e^{z_{j_1}}+e^{z_{j_2}}$ the cross entropy error is

E = - (t_{1} l o g o_{j_{1}} + t_{2} l o g o_{j_{2}}) = - (t_{1} (z_{j_{1}} - l o g (Ω)) + t_{2} (z_{j_{2}} - l o g (Ω)))

$E=-(t_1 log o_{j_1}+t_2 log o_{j_2})=-(t_1(z_{j_1}-log(\Omega))+t_2(z_{j_2}-log(\Omega)))$ Then the derivative is

\frac{\partial E}{\partial (z_{j_{1}}} = - (t_{1} - t_{1} \frac{e^{z_{j_{1}}}}{Ω} - t_{2} \frac{e^{z_{j_{2}}}}{Ω}) = - t_{1} + o_{j_{1}} (t_{1} + t_{2})

$\frac{\partial E}{\partial (z_{j_1}}=-(t_1-t_1 \frac{e^{z_{j_1}}}{\Omega}-t_2 \frac{e^{z_{j_2}}}{\Omega})=-t_1+o_{j_1}(t_1+t_2)$ which conforms with your result... taking in account that you didn't have the minus sign before the error sum

— micha

But a further question I have is: Instead of

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$ which is generally what your introduced to with backpropagation, you calculated:

\frac{\partial E}{\partial w_{i j}} = \frac{\partial E}{\partial z_{j}} \frac{\partial z_{j}}{\partial w_{i j}}

$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$ as like to cancel out the

\partial o_{j}

$\partial o_j$ . Why is this way leading to the right result?

— micha

12

While @GeoMatt22's answer is correct, I personally found it very useful to reduce the problem to a toy example and draw a picture:

I then defined the operations each node was computing, treating the $h$ 's and $w$ 's as inputs to a "network" ( $\mathbf{t}$ is a one-hot vector representing the class label of the data point):

L = - t_{1} \log o_{1} - t_{2} \log o_{2}

$L=-t_1\log o_1 -t_2\log o_2$

o_{1} = \frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})}

$o_1 = \frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)}$

o_{2} = \frac{\exp (y_{2})}{\exp (y_{1}) + \exp (y_{2})}

$o_2 = \frac{\exp(y_2)}{\exp(y_1) + \exp(y_2)}$

y_{1} = w_{11} h_{1} + w_{21} h_{2} + w_{31} h_{3}

$y_1 = w_{11}h_1 + w_{21}h_2 + w_{31}h_3$

y_{2} = w_{12} h_{1} + w_{22} h_{2} + w_{32} h_{3}

$y_2 = w_{12}h_1 + w_{22}h_2 + w_{32}h_3$

Say I want to calculate the derivative of the loss with respect to $w_{21}$ . I can just use my picture to trace back the path from the loss to the weight I'm interested in (removed the second column of $w$ 's for clarity):

Then, I can just calculate the desired derivatives. Note that there are two paths through $y_1$ that lead to $w_{21}$ , so I need to sum the derivatives that go through each of them.

\frac{\partial L}{\partial o_{1}} = - \frac{t_{1}}{o_{1}}

$\frac{\partial L}{\partial o_1} = -\frac{t_1}{o_1}$

\frac{\partial L}{\partial o_{2}} = - \frac{t_{2}}{o_{2}}

$\frac{\partial L}{\partial o_2} = -\frac{t_2}{o_2}$

\frac{\partial o_{1}}{\partial y_{1}} = \frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})} - {(\frac{\exp (y_{1})}{\exp (y_{1}) + \exp (y_{2})})}^{2} = o_{1} (1 - o_{1})

$\frac{\partial o_1}{\partial y_1} = \frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)} - \left(\frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)}\right)^2 = o_1(1 - o_1)$

\frac{\partial o_{2}}{\partial y_{1}} = \frac{- \exp (y_{2}) \exp (y_{1})}{(\exp (y_{1}) + \exp (y_{2}))^{2}} = - o_{2} o_{1}

$\frac{\partial o_2}{\partial y_1} = \frac{-\exp(y_2)\exp(y_1)}{(\exp(y_1) + \exp(y_2))^2} = -o_2o_1$

\frac{\partial y_{1}}{\partial w_{21}} = h_{2}

$\frac{\partial y_1}{\partial w_{21}} = h_2$

Finally, putting the chain rule together:

\begin{aligned} \frac{\partial L}{\partial w_{21}} & = \frac{\partial L}{\partial o_{1}} \frac{\partial o_{1}}{\partial y_{1}} \frac{\partial y_{1}}{\partial w_{21}} + \frac{\partial L}{\partial o_{2}} \frac{\partial o_{2}}{\partial y_{1}} \frac{\partial y_{1}}{\partial w_{21}} \\ = \frac{- t_{1}}{o_{1}} [o_{1} (1 - o_{1})] h_{2} + \frac{- t_{2}}{o_{2}} (- o_{2} o_{1}) h_{2} \\ = h_{2} (t_{2} o_{1} - t_{1} + t_{1} o_{1}) \\ = h_{2} (o_{1} (t_{1} + t_{2}) - t_{1}) \\ = h_{2} (o_{1} - t_{1}) \end{aligned}

$\begin{align} \frac{\partial L}{\partial w_{21}} &= \frac{\partial L}{\partial o_1}\frac{\partial o_1}{\partial y_1}\frac{\partial y_1}{\partial w_{21}} + \frac{\partial L}{\partial o_2}\frac{\partial o_2}{\partial y_1}\frac{\partial y_1}{\partial w_{21}}\\ &= \frac{-t_1}{o_1}[o_1(1 - o_1)]h_2 + \frac{-t_2}{o_2}(-o_2 o_1)h_2\\ &= h_2(t_2 o_1 - t_1 + t_1 o_1)\\ &= h_2(o_1(t_1 + t_2) - t_1)\\ &= h_2(o_1 - t_1) \end{align}$

Note that in the last step, $t_1 + t_2 = 1$ because the vector $\mathbf{t}$ is a one-hot vector.

— Vivek Subramanian
sumber

This is what finally cleared this up for me! Excellent and Elegant explanation!!!!

— SantoshGupta7

2

I’m glad you both enjoyed and benefited from reading my post! It was also helpful for me to write it out and explain it.

— Vivek Subramanian

@VivekSubramanian should it be

= \frac{- t_{1}}{o_{1}} [o_{1} (1 - o_{1})] h_{2} + \frac{- t_{2}}{o_{2}} (- o_{2} o_{1}) h_{2}

$= \frac{-t_1}{o_1}[o_1(1 - o_1)]h_2 + \frac{-t_2}{o_2}(-o_2 o_1)h_2\\$ instead ?

— koryakinp

You’re right - it was a typo! I will make the change.

— Vivek Subramanian

The thing i do not understand here is that you also assign logits (unscaled scores) to some neurons. (o is softmaxed logits (predictions) and y is logits in your case). However, this is not the case normally, is not it? Look at this picture ( o_out1 is prediction and o_in1 is logits) so how is it possible in this case how can you find the partial derivative of o2 with respect to y1?

— ARAT

6

In place of the $\{o_i\},\,$ I want a letter whose uppercase is visually distinct from its lowercase. So let me substitute $\{y_i\}$ . Also, let's use the variable $\{p_i\}$ to designate the $\{o_i\}$ from the previous layer.

Let $Y$ be the diagonal matrix whose diagonal equals the vector $y$ , i.e.

Y = D i a g (y)

$Y={\rm Diag}(y)$ Using this new matrix variable and the Frobenius Inner Product we can calculate the gradient of

E

$E$ wrt

W

$W$ .

\begin{aligned} z & = W p + b & d z = d W p \\ y & = s o f t m a x (z) & d y = (Y - y y^{T}) d z \\ E & = - t : \log (y) & d E = - t : Y^{- 1} d y \\ d E & = - t : Y^{- 1} (Y - y y^{T}) d z \\ = - t : (I - 1 y^{T}) d z \\ = - t : (I - 1 y^{T}) d W p \\ = (y 1^{T} - I) t p^{T} : d W \\ = ((1^{T} t) y p^{T} - t p^{T}) : d W \\ \frac{\partial E}{\partial W} & = (1^{T} t) y p^{T} - t p^{T} \end{aligned}

$\eqalign{ z &= Wp+b &dz= dWp \cr y &= {\rm softmax}(z) &dy = (Y-yy^T)\,dz \cr E &= -t:\log(y) &dE = -t:Y^{-1}dy \cr\cr dE &= -t:Y^{-1}(Y-yy^T)\,dz \cr &= -t:(I-1y^T)\,dz \cr &= -t:(I-1y^T)\,dW\,p \cr &= (y1^T-I)tp^T:dW \cr &= ((1^Tt)yp^T - tp^T):dW \cr\cr \frac{\partial E}{\partial W} &= (1^Tt)yp^T - tp^T \cr }$

— frank
sumber

6

Here is one of the cleanest and well written notes that I came across the web which explains about "calculation of derivatives in backpropagation algorithm with cross entropy loss function".

— yottabytt
sumber

In the given pdf how did equation 22 become equation 23? As in how did the Summation(k!=i) get a negative sign. Shouldn't it get a positive sign? Like Summation(Fn)(For All K) = Fn(k=i) + Summation(Fn)(k!=i) should be happening according to my understanding.

— faizan

1

Here's a link explaining the softmax and its derivative.

It explains the reason for using i=j and i!=j.

— S. Muhammad H. Mustafa
sumber

It is recommended to provide a minimal, stand-alone answer, in case that link gets broken in the future. Otherwise, this might no longer help other users in the future.

— luchonacho

0

Other answers have provided the correct way of calculating the derivative, but they do not point out where you have gone wrong. In fact, $t_j$ is always 1 in your last equation, cause you have assumed that $o_j$ takes that node of target 1 in your output; $o_j$ of other nodes have different forms of probability function, thus lead to different forms of derivative, so you should now understand why other people have treated $i=j$ and $i\neq j$ differently.

— kuixiong
sumber