Memperkirakan ukuran populasi dari frekuensi duplikat dan unik sampel

Ada layanan web di mana saya dapat meminta informasi tentang item acak. Untuk setiap permintaan, setiap item memiliki peluang yang sama untuk dikembalikan.

Saya dapat terus meminta barang dan mencatat jumlah duplikat dan unik. Bagaimana saya bisa menggunakan data ini untuk memperkirakan jumlah item?

probability population coupon-collector-problem

— hoju
sumber

Yang ingin Anda perkirakan bukanlah ukuran sampel, tetapi ukuran populasi (jumlah total item unik yang dikembalikan oleh layanan web).

— GaBorgulya

Jawaban:

Ini pada dasarnya adalah varian dari masalah pengumpul kupon.

Jika ada item total dan Anda telah mengambil ukuran sampel dengan penggantian maka kemungkinan telah diidentifikasi item unik adalah $n$ $s$ $u$ di manamemberikannomor Stirling dari jenis kedua

P r (U = u | n, s) = \frac{S_{2} (s, u) n!}{(n - u)! n^{s}}

$Pr(U=u|n,s) = \frac{S_2(s,u) n! }{ (n-u)! n^s }$

S_{2} (s, u)

$S_2(s,u)$

Sekarang semua yang Anda butuhkan adalah distribusi sebelumnya untuk , berlaku teorema Bayes, dan mendapatkan distribusi posterior untuk . $Pr(N=n)$ $N$

— Henry
sumber

Ini tampaknya kehilangan beberapa informasi karena tidak memperhitungkan frekuensi item yang diamati 2, 3, 4, ... kali.

— Whuber

@whuber: Tampaknya tidak menggunakan informasi, tetapi jika Anda menyelidiki lebih lanjut, Anda akan menemukan bahwa jumlah item unik adalah statistik yang cukup. Misalnya, jika Anda mengambil sampel dengan penggantian 4 item dari populasi

, probabilitas mendapatkan 3 item satu dan 1 item lainnya adalah

n

$n$

bahwa mendapatkan 2 masing-masing dari dua item, tidak peduli apa

adalah, jadi mengetahui frekuensi rinci tidak memberikan informasi yang lebih berguna tentang populasi daripada hanya mengetahui ada dua item unik yang ditemukan dalam sampel.

\frac{4}{3}

$\frac{4}{3}$

n

$n$

— Henry

Poin menarik tentang kecukupan jumlah barang unik. Jadi frekuensi dapat berfungsi sebagai pemeriksaan pada asumsi (independensi dan probabilitas yang sama), tetapi sebaliknya tidak perlu.

— whuber

Saya sudah memberikan saran berdasarkan angka Stirling dari jenis kedua dan metode Bayesian.

Bagi mereka yang menemukan angka Stirling terlalu besar atau metode Bayesian terlalu sulit, metode yang lebih kasar mungkin untuk digunakan

E [U | n, s] = n (1 - {(1 - \frac{1}{n})}^{s})

$E[U|n,s] = n\left( 1- \left(1-\frac{1}{n}\right)^s\right)$

v a r [U | n, s] = n {(1 - \frac{1}{n})}^{s} + n^{2} (1 - \frac{1}{n}) {(1 - \frac{2}{n})}^{s} - n^{2} {(1 - \frac{1}{n})}^{2 s}

$var[U|n,s] = n\left(1-\frac{1}{n}\right)^s + n^2 \left(1-\frac{1}{n}\right)\left(1-\frac{2}{n}\right)^s - n^2\left(1-\frac{1}{n}\right)^{2s}$

dan menghitung kembali menggunakan metode numerik.

$s=300$ $U = 265$ $\hat{n} \approx 1180$

$U$ $n$

— Henry
sumber

(+1) It's interesting to note that if the ratio

s / n

$s/n$ is very small then essentially no information about

n

$n$ can be present and so one can't expect to do very well estimating

n

$n$ . If

s / n

$s/n$ is very large then

U

$U$ is a good estimator. So, we need something that works in a mid-range.

— cardinal

Also

1 - (1 - 1 / n)^{s} \geq (1 - f_{k} (s / n)) / f_{k} (s / n)

$1 - (1-1/n)^s \geq (1-f_k(s/n)) / f_k(s/n)$ where

f_{k} (x) = \sum_{i = 0}^{k} x^{i} / i!

$f_k(x) = \sum_{i=0}^k x^i/i!$ is the

k

$k$ th-order Taylor series approximation to

e^{x}

$e^x$ . Using

k = 1

$k=1$ gives an estimator

\tilde{n} = \frac{s}{s - U} U

$\tilde{n} = \frac{s}{s-U} U$ . A continuity correction for small

s

$s$ can be obtained by adding a constant (like 1) in the denominator. This estimator doesn't do as well for the example as numerically solving for

\hat{n}

$\hat{n}$ as you've done, though.

— cardinal

You can use the capture-recapture method, also implemented as the Rcapture R package.

Here is an example, coded in R. Let's assume that the web service has N=1000 items. We will make n=300 requests. Generate a random sample where, numbering the elements from 1 to k, where k is how many different items we saw.

N = 1000; population = 1:N # create a population of the integers from 1 to 1000
n = 300 # number of requests
set.seed(20110406)
observation = as.numeric(factor(sample(population, size=n,
  replace=TRUE))) # a random sample from the population, renumbered
table(observation) # a table useful to see, not discussed
k = length(unique(observation)) # number of unique items seen
(t = table(table(observation)))

The result of the simulation is

  1   2   3 
234  27   4

thus among the 300 requests there were 4 items seen 3 times, 27 items seen twice, and 234 items seen only once.

Now estimate N from this sample:

require(Rcapture)
X = data.frame(t)
X[,1]=as.numeric(X[,1])
desc=descriptive(X, dfreq=TRUE, dtype="nbcap", t=300)
desc # useful to see, not discussed
plot(desc) # useful to see, not discussed
cp=closedp.0(X, dfreq=TRUE, dtype="nbcap", t=300, trace=TRUE)
cp

The result:

Number of captured units: 265 

Abundance estimations and model fits:
                  abundance       stderr      deviance   df           AIC
M0**                  265.0          0.0  2.297787e+39  298  2.297787e+39
Mh Chao              1262.7        232.5  7.840000e-01    9  5.984840e+02
Mh Poisson2**         265.0          0.0  2.977883e+38  297  2.977883e+38
Mh Darroch**          553.9         37.1  7.299900e+01  297  9.469900e+01
Mh Gamma3.5**  5644623606.6  375581044.0  5.821861e+05  297  5.822078e+05

 ** : The M0 model did not converge
 ** : The Mh Poisson2 model did not converge
 ** : The Mh Darroch model did not converge
 ** : The Mh Gamma3.5 model did not converge
Note: 9 eta parameters has been set to zero in the Mh Chao model

Thus only the Mh Chao model converged, it estimated $\hat{N}$ =1262.7.

EDIT: To check the reliability of the above method I ran the above code on 10000 generated samples. The Mh Chao model converged every time. Here is the summary:

> round(quantile(Nhat, c(0, 0.025, 0.25, 0.50, 0.75, 0.975, 1)), 1)
    0%   2.5%    25%    50%    75%  97.5%   100% 
 657.2  794.6  941.1 1034.0 1144.8 1445.2 2162.0 
> mean(Nhat)
[1] 1055.855
> sd(Nhat)
[1] 166.8352

— GaBorgulya
sumber

It seems some justification for the use of capture-recapture models is needed, because this is not a standard capture-recapture experiment. (Possibly it can be viewed as 300 capture events, but the call to closedp does not seem to indicate that.)

— whuber

@whuber Yes, I did view the example as 300 capture events. How do you mean that "the call to closedp does not seem to indicate that"? I appreciate (constructive) criticism and I'm happy to correct (or delete if necessary) my answer if it turns out to be wrong.

— GaBorgulya

this seems a reasonable approach. However I won't be using R so need to understand the maths behind it. The wiki page covers a 2 event situation - how do I apply it to this case?

— hoju

@Ga I see: You created a 300 x 300 matrix for the data! The inefficiency of this code fooled me: it would be simpler and more direct to use `closedp.0(Y, dfreq=TRUE, dtype="nbcap", t=300)' where Y is the frequency matrix {{1,234}, {2,27}, {3,4}} (which you computed twice and actually displayed!). More to the point, the convergence failures are alarming, suggesting there are problems with the underlying code or models. (An exhaustive search of the docs for "M0" turns up no references or description for this method...)

— whuber

@whuber I simplified the code following your suggestion (dfreq=TRUE, dtype="nbcap", t=300). Thanks again.

— GaBorgulya