Koneksi antara Fisher metrik dan entropi relatif

20

Dapatkah seseorang membuktikan hubungan berikut antara metrik informasi Fisher dan entropi relatif (atau divergensi KL) dengan cara yang benar-benar ketat secara matematis?

D (p (\cdot, a + d a) ∥ p (\cdot, a)) = \frac{1}{2} g_{i, j} d a^{i} d a^{j} + (O (‖ d a ‖^{3})

$D( p(\cdot , a+da) \parallel p(\cdot,a) ) =\frac{1}{2} g_{i,j} \, da^i \, da^j + (O( \|da\|^3)$ where

a = (a^{1}, \dots, a^{n}), d a = (d a^{1}, \dots, d a^{n})

$a=(a^1,\dots, a^n), da=(da^1,\dots,da^n)$ ,

g_{i, j} = \int \partial_{i} (\log p (x; a)) \partial_{j} (\log p (x; a)) p (x; a) d x

$g_{i,j}=\int \partial_i (\log p(x;a)) \partial_j(\log p(x;a))~ p(x;a)~dx$ and

g_{i, j} d a^{i} d a^{j} := \sum_{i, j} g_{i, j} d a^{i} d a^{j}

$g_{i,j} \, da^i \, da^j := \sum_{i,j}g_{i,j} \, da^i \, da^j$ is the Einstein summation convention.

I found the above in the nice blog of John Baez where Vasileios Anagnostopoulos says about that in the comments.

mathematical-statistics kullback-leibler fisher-information

— Kumara
sumber

1

Dear Kumara: For clarify, it would help to better explain your notation, specifically the meaning of

g_{i, j}

$g_{i,j}$ . Also, I think your expression is missing a constant factor of

1 / 2

$1/2$ in front of the first term of the right-hand side of the display equation. Note that what Kullback himself called divergence (using the notation

J (\cdot, \cdot)

$J(\cdot,\cdot)$ ) is the symmetrized version of what is know called the KL divergence, i.e.,

J (p, q) = D (p ‖ q) + D (q ‖ p)

$J(p,q) = D(p \| q) + D(q \| p)$ . The KL divergence was denoted

I (\cdot, \cdot)

$I(\cdot,\cdot)$ in Kullback's writings. This explains the factor of

1 / 2

$1/2$ as well. Cheers.

— cardinal

19

In 1946, geophysicist and Bayesian statistician Harold Jeffreys introduced what we today call the Kullback-Leibler divergence, and discovered that for two distributions that are "infinitely close" (let's hope that Math SE guys don't see this ;-) we can write their Kullback-Leibler divergence as a quadratic form whose coefficients are given by the elements of the Fisher information matrix. He interpreted this quadratic form as the element of length of a Riemannian manifold, with the Fisher information playing the role of the Riemannian metric. From this geometrization of the statistical model, he derived his Jeffreys's prior as the measure naturally induced by the Riemannian metric, and this measure can be interpreted as an intrinsically uniform distribution on the manifold, although, in general, it is not a finite measure.

To write a rigorous proof, you'll need to spot out all the regularity conditions and take care of the order of the error terms in the Taylor expansions. Here is a brief sketch of the argument.

The symmetrized Kullback-Leibler divergence between two densities $f$ and $g$ is defined as

D [f, g] = \int (f (x) - g (x)) \log (\frac{f (x)}{g (x)}) d x .

$D[f,g] = \int (f(x) - g(x)) \log\left(\frac{f(x)}{g(x)} \right) dx \, .$

If we have a family of densities parameterized by $\theta=(\theta_1,\dots,\theta_k)$ , then

D [p (\cdot ∣ θ), p (\cdot ∣ θ + Δ θ)] = \int (p (x, ∣ θ) - p (x ∣ θ + Δ θ)) \log (\frac{p (x ∣ θ)}{p (x ∣ θ + Δ θ)}) d x,

$D[p(\,\cdot\,\mid\theta), p(\,\cdot\,\mid\theta + \Delta\theta)] = \int ( p(x,\mid\theta) - p(x\mid\theta + \Delta\theta)) \log\left( \frac{p(x\mid\theta)}{p(x\mid\theta + \Delta\theta)}\right) \,dx \, ,$ in which

Δ θ = (Δ θ_{1}, \dots, Δ θ_{k})

$\Delta\theta=(\Delta\theta_1,\dots,\Delta\theta_k)$ . Introducing the notation

Δ p (x ∣ θ) = p (x ∣ θ) - p (x ∣ θ + Δ θ),

$\Delta p(x\mid\theta) = p(x\mid\theta) - p(x\mid\theta + \Delta\theta) \, ,$ some simple algebra gives

D [p (\cdot ∣ θ), p (\cdot ∣ θ + Δ θ)] = \int \frac{Δ p (x ∣ θ)}{p (x ∣ θ)} \log (1 + \frac{Δ p (x ∣ θ)}{p (x ∣ θ)}) p (x ∣ θ) d x .

$D[p(\;\cdot\,\mid\theta), p(\;\cdot\,\mid\theta + \Delta\theta)] = \int\frac{\Delta p(x\mid\theta)}{p(x\mid\theta)} \log\left(1+\frac{\Delta p(x\mid\theta)}{p(x\mid\theta)}\right)p(x\mid\theta)\,dx \, .$ Using the Taylor expansion for the natural logarithm, we have

\log (1 + \frac{Δ p (x ∣ θ)}{p (x ∣ θ)}) \approx \frac{Δ p (x ∣ θ)}{p (x ∣ θ)},

$\log\left(1+\frac{\Delta p(x\mid\theta)}{p(x\mid\theta)}\right) \approx \frac{\Delta p(x\mid\theta)}{p(x\mid\theta)} \, ,$ and therefore

D [p (\cdot ∣ θ), p (\cdot ∣ θ + Δ θ)] \approx \int {(\frac{Δ p (x ∣ θ)}{p (x ∣ θ)})}^{2} p (x ∣ θ) d x .

$D[p(\;\cdot\,\mid\theta), p(\;\cdot\,\mid\theta + \Delta\theta)] \approx \int\left(\frac{\Delta p(x\mid\theta)}{p(x\mid\theta)}\right)^2p(x\mid\theta)\,dx \, .$ But

\frac{Δ p (x ∣ θ)}{p (x ∣ θ)} \approx \frac{1}{p (x ∣ θ)} \sum_{i = 1}^{k} \frac{\partial p (x ∣ θ)}{\partial θ_{i}} Δ θ_{i} = \sum_{i = 1}^{k} \frac{\partial \log p (x ∣ θ)}{\partial θ_{i}} Δ θ_{i} .

$\frac{\Delta p(x\mid\theta)}{p(x\mid\theta)} \approx \frac{1}{p(x\mid\theta)} \sum_{i=1}^k \frac{\partial p(x\mid\theta)}{\partial\theta_i} \, \Delta\theta_i = \sum_{i=1}^k \frac{\partial \log p(x\mid\theta)}{\partial\theta_i} \, \Delta\theta_i \, .$ Hence

D [p (\cdot ∣ θ), p (\cdot ∣ θ + Δ θ)] \approx \sum_{i, j = 1}^{k} g_{i j} Δ θ_{i} Δ θ_{j},

$D[p(\,\cdot\,\mid\theta), p(\,\cdot\,\mid\theta + \Delta\theta)] \approx \sum_{i,j=1}^k g_{ij} \,\Delta\theta_i \, \Delta\theta_j \, ,$ in which

g_{i j} = \int \frac{\partial \log p (x ∣ θ)}{\partial θ_{i}} \frac{\partial \log p (x ∣ θ)}{\partial θ_{j}} p (x ∣ θ) d x .

$g_{ij} = \int \frac{\partial \log p(x\mid\theta)}{\partial\theta_i} \frac{\partial \log p(x\mid\theta)}{\partial\theta_j} p(x\mid\theta) \,dx \, .$

This is the original paper:

Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Royal Soc. of London, Series A, 186, 453–461.

— Zen
sumber

1

Thank you very much for the nice writing. It would be nice if you can help this as well.

— Kumara

Yes, you rightly said. I must come out of this "abstraction trap".

— Kumara

@zen You are using the Taylor expansion of logarithm under the integral, why is that valid?

— Sus20200

1

It seems crucial that you start with the symmetrized KL divergence, as opposed to the standard KL divergence. The Wikipedia article makes no mention of the symmetrized version, and so it might possibly be incorrect. en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

— Surgical Commander

11

Proof for usual (non-symmetric) KL divergence

Zen's answer uses the symmetrized KL divergence, but the result holds for the usual form as well, since it becomes symmetric for infinitesimally close distributions.

Here's a proof for discrete distributions parameterized by a scalar $\theta$ (because I'm lazy), but can be easily re-written for continuous distributions or a vector of parameters:

D (p_{θ}, p_{θ + d θ}) = \sum p_{θ} \log p_{θ} - \sum p_{θ} \log p_{θ + d θ} .

$\begin{equation} D(p_\theta,p_{\theta+d\theta})=\sum p_\theta \log p_\theta - \sum p_\theta \log p_{\theta+d\theta}\ . \end{equation}$ Taylor-expanding the last term:

= \underset{= 0}{\underset{⏟}{\sum p_{θ} \log p_{θ} - \sum p_{θ} \log p_{θ}}} - d θ \underset{= 0 †}{\underset{⏟}{\sum p_{θ} \frac{d}{d θ} \log p_{θ}}} - \frac{1}{2} {d θ}^{2} \underset{= - \sum p_{θ} (\frac{d}{d θ} \log p_{θ})^{2} ‡}{\underset{⏟}{\sum p_{θ} \frac{d^{2}}{d θ^{2}} \log p_{θ}}} + O ({d θ}^{3}) = \frac{1}{2} {d θ}^{2} \underset{Fisher information}{\underset{⏟}{\sum p_{θ} (\frac{d}{d θ} \log p_{θ})^{2}}} + O ({d θ}^{3}) .

$\begin{equation} = \underbrace{\sum p_\theta \log p_\theta - \sum p_\theta \log p_\theta}_{=\ 0} - d\theta \underbrace{\sum p_\theta \frac{d}{d\theta}\log p_\theta}_{=\ 0 \ \dagger} - \frac{1}{2}{d\theta}^2 \underbrace{\sum p_\theta \frac{d^2}{d\theta^2}\log p_\theta}_{= -\sum p_\theta (\frac{d}{d\theta}\log p_\theta)^2 \ \ddagger} + \mathcal{O}({d\theta}^3) \\ = \frac{1}{2}{d\theta}^2 \underbrace{\sum p_\theta (\frac{d}{d\theta}\log p_\theta)^2}_{\textrm{Fisher information}} + \mathcal{O}({d\theta}^3). \end{equation}$ Assuming some regularities, I have used the two results:

† : \sum p_{θ} \frac{d}{d θ} \log p_{θ} = \sum \frac{d}{d θ} p_{θ} = \frac{d}{d θ} \sum p_{θ} = 0,

$\begin{equation} \dagger: \sum p_\theta \frac{d}{d\theta}\log p_\theta = \sum \frac{d}{d\theta} p_\theta = \frac{d}{d\theta} \sum p_\theta =0, \end{equation}$

\begin{aligned} ‡ : \sum p_{θ} \frac{d^{2}}{d θ^{2}} \log p_{θ} & = \sum p_{θ} \frac{d}{d θ} (\frac{1}{p_{θ}} \frac{d p_{θ}}{d θ}) \\ = \sum p_{θ} [\frac{1}{p_{θ}} \frac{d^{2} p_{θ}}{d θ} - (\frac{1}{p_{θ}} \frac{d p_{θ}}{d θ})^{2}] \\ = \sum \frac{d^{2} p_{θ}}{d θ^{2}} - \sum p_{θ} (\frac{1}{p_{θ}} \frac{d p_{θ}}{d θ})^{2} \\ = \underset{= 0}{\underset{⏟}{\frac{d^{2}}{d θ^{2}} \sum p_{θ}}} - \sum p_{θ} (\frac{d}{d θ} \log p_{θ})^{2} . \end{aligned}

$\begin{align} \ddagger: \sum p_\theta \frac{d^2}{d\theta^2}\log p_\theta &= \sum p_\theta \frac{d}{d\theta}(\frac{1}{p_\theta}\frac{dp_\theta}{d\theta}) \\ &= \sum p_\theta \left[\frac{1}{p_\theta}\frac{d^2p_\theta}{d\theta}-(\frac{1}{p_\theta}\frac{dp_\theta}{d\theta})^2\right] \\ &= \sum \frac{d^2p_\theta}{d\theta^2} - \sum p_\theta (\frac{1}{p_\theta} \frac{dp_\theta}{d\theta})^2 \\ &= \underbrace{\frac{d^2}{d\theta^2} \sum p_\theta}_{=\ 0} - \sum {p_\theta} (\frac{d}{d\theta}\log p_\theta)^2. \end{align}$

— Abhranil Das
sumber

4

You can find a similar relationship (for a one-dimensional parameter) in equation (3) of the following paper

D. Guo (2009), Relative Entropy and Score Function: New Information–Estimation Relationships through Arbitrary Additive Perturbation, in Proc. IEEE International Symposium on Information Theory, 814–818. (stable link).

The authors refer to

S. Kullback, Information Theory and Statistics. New York: Dover, 1968.

for a proof of this result.

— Primo Carnera
sumber

1

A multivariate version of equation (3) of that paper is proven in the cited Kullback text on pages 27-28. The constant

1 / 2

$1/2$ seems to have gone missing in the OP's question. :)

— cardinal