The Gibbs sampler discussed at http://bit.ly/IWhJ52 and also on http://dirk.eddelbuettel.com/blog/2011/07/14/ has been implemented in several languages.
Dirk kindly provides an R script in the examples section of the Rcpp package to time versions in native R and in R calling compiled C++ code. On my desktop computer with a 4-core processor the timing of 10 replications to produce 20000 samples with thinning of 200 produces
test replications elapsed relative user.self sys.self
4 GSLGibbs(N, thn) 10 8.338 1.000000 8.336 0.000
3 RcppGibbs(N, thn) 10 13.285 1.593308 13.285 0.000
2 RCgibbs(N, thn) 10 369.843 44.356320 369.327 0.032
1 Rgibbs(N, thn) 10 473.511 56.789518 472.754 0.044
Using the Julia source file SimpleGibbs.jl shown above I get
julia> sum([@elapsed JGibbs1(20000, 200) for i=1:10])
11.385021209716797
julia> sum([@elapsed JGibbs1(20000, 200) for i=1:10])
11.430732011795044
julia> sum([@elapsed JGibbs2(20000, 200) for i=1:10])
26.221017122268677
julia> sum([@elapsed JGibbs2(20000, 200) for i=1:10])
26.195765256881714
julia> sum([@elapsed dJGibbs1(20000, 200) for i=1:10])
5.465957164764404
julia> sum([@elapsed dJGibbs1(20000, 200) for i=1:10])
3.997025966644287
julia> sum([@elapsed dJGibbs1(20000, 200) for i=1:10])
4.039087772369385
julia> sum([@elapsed dJGibbs2(20000, 200) for i=1:10])
3.6576080322265625
julia> sum([@elapsed dJGibbs2(20000, 200) for i=1:10])
3.7252891063690186
julia> sum([@elapsed dJGibbs2(20000, 200) for i=1:10])
3.7631633281707764
when starting Julia as
julia -p 4
Things to notice are that the JGibbs2 function is, apart from minor syntactic differences, essentially the same as Dirk's or Darren's R functions, including using the same random number generators. 10 replications require 26 seconds for this Julia version as opposed to 370 seconds for RCgibbs
(byte-compiled) and 474 seconds for Rgibbs
. Using the random number generators from Julia, the JGibbs1
version takes about 11.4 seconds. (R's rgamma
is known to be slow.)
Compared to the compiled code in Rcpp or RcppGSL the Julia version is within a factor of 2 of the compiled code (JGibbs1
should be compared to GSLGibbs
and JGibbs2
to RcppGibbs
).
Finally the versions that use distributed arrays on 4 cores are the fastest by far. Using 4 cores provides a speedup of about 3x when converting the distributed array back to a regular array. The dJGibbs2
function is slightly faster because it leaves the result as a distributed array, which would be the approach if further distributed processing was to be done.
Oh, and notice that creating the distributed version from the original is a one-liner.