(Karena pendekatan ini tidak tergantung pada solusi lain yang dipasang, termasuk yang telah saya posting, saya menawarkannya sebagai tanggapan terpisah).
Anda dapat menghitung distribusi tepat dalam hitungan detik (atau kurang) asalkan jumlah pnya kecil.
Kami telah melihat saran bahwa distribusi mungkin kira-kira Gaussian (dalam beberapa skenario) atau Poisson (dalam skenario lain). Either way, kita tahu rerata adalah jumlah dari p i dan variansnya σ 2 adalah jumlah dari p i ( 1 - p i ) . Oleh karena itu distribusi akan terkonsentrasi dalam beberapa standar deviasi dari rata-rata, katakanlah z SD dengan z antara 4 dan 6 atau sekitar itu. Karena itu kita hanya perlu menghitung probabilitas bahwa jumlah X sama dengan (bilangan bulat) k untuk k = μμpiσ2pi(1−pi)zzXk hingga k = μ + z σ . Ketika sebagian besar p i kecil, σ 2 kira-kira sama dengan (tetapi sedikit kurang dari) μ , jadi untuk menjadi konservatif kita dapat melakukan perhitungan untuk k dalam interval [ μ - z √k=μ−zσk=μ+zσpiσ2μk[μ−zμ−−√,μ+zμ−−√]. For example, when the sum of the pi equals 9 and choosing z=6 in order to cover the tails well, we would need the computation to cover k in [9−69–√,9+69–√] = [0,27], which is just 28 values.
The distribution is computed recursively. Let fi be the distribution of the sum of the first i of these Bernoulli variables. For any j from 0 through i+1, the sum of the first i+1 variables can equal j in two mutually exclusive ways: the sum of the first i variables equals j and the i+1st is 0 or else the sum of the first i variables equals j−1 and the i+1st is 1. Therefore
fi+1(j)=fi(j)(1−pi+1)+fi(j−1)pi+1.
We only need to carry out this computation for integral j in the interval from max(0,μ−zμ−−√) to μ+zμ−−√.
When most of the pi are tiny (but the 1−pi are still distinguishable from 1 with reasonable precision), this approach is not plagued with the huge accumulation of floating point roundoff errors used in the solution I previously posted. Therefore, extended-precision computation is not required. For example, a double-precision calculation for an array of 216 probabilities pi=1/(i+1) (μ=10.6676, requiring calculations for probabilities of sums between 0 and 31) took 0.1 seconds with Mathematica 8 and 1-2 seconds with Excel 2002 (both obtained the same answers). Repeating it with quadruple precision (in Mathematica) took about 2 seconds but did not change any answer by more than 3×10−15. Terminating the distribution at z=6 SDs into the upper tail lost only 3.6×10−8 of the total probability.
Another calculation for an array of 40,000 double precision random values between 0 and 0.001 (μ=19.9093) took 0.08 seconds with Mathematica.
This algorithm is parallelizable. Just break the set of pi into disjoint subsets of approximately equal size, one per processor. Compute the distribution for each subset, then convolve the results (using FFT if you like, although this speedup is probably unnecessary) to obtain the full answer. This makes it practical to use even when μ gets large, when you need to look far out into the tails (z large), and/or n is large.
The timing for an array of n variables with m processors scales as O(n(μ+zμ−−√)/m). Mathematica's speed is on the order of a million per second. For example, with m=1 processor, n=20000 variates, a total probability of μ=100, and going out to z=6 standard deviations into the upper tail, n(μ+zμ−−√)/m=3.2 million: figure a couple seconds of computing time. If you compile this you might speed up the performance two orders of magnitude.
Incidentally, in these test cases, graphs of the distribution clearly showed some positive skewness: they aren't normal.
For the record, here is a Mathematica solution:
pb[p_, z_] := Module[
{\[Mu] = Total[p]},
Fold[#1 - #2 Differences[Prepend[#1, 0]] &,
Prepend[ConstantArray[0, Ceiling[\[Mu] + Sqrt[\[Mu]] z]], 1], p]
]
(NB The color coding applied by this site is meaningless for Mathematica code. In particular, the gray stuff is not comments: it's where all the work is done!)
An example of its use is
pb[RandomReal[{0, 0.001}, 40000], 8]
Edit
An R
solution is ten times slower than Mathematica in this test case--perhaps I have not coded it optimally--but it still executes quickly (about one second):
pb <- function(p, z) {
mu <- sum(p)
x <- c(1, rep(0, ceiling(mu + sqrt(mu) * z)))
f <- function(v) {x <<- x - v * diff(c(0, x));}
sapply(p, f); x
}
y <- pb(runif(40000, 0, 0.001), 8)
plot(y)