[R-sig-ME] parallel MCMCglmm, RNGstreams, starting values & priors

Mon Aug 25 18:14:14 CEST 2014

Hi Ruben,

You will need to provide over-dispersed starting values for  
multiple-chain convergence diagnostics to be useful (GLMM are so  
simple I am generally happy if the output of a single run looks  
reasonable).

With non-Gaussian data everything is Gibbs sampled conditional on the  
latent variables, so you only need to pass them:

l<-rnorm(200, -1, sqrt(1))
t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
y<-rpois(200,exp(l)-t)+1
# generate zero-truncated data with an intercept of -1

dat<-data.frame(y=y)
set.seed(1)
m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l))
# use true latent variable as starting values
set.seed(1)
m2<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=rnorm(200)))
# use some very bad starting values

plot(mcmc.list(m1$Sol, m2$Sol))
# not identical despite the same seed because of different starting  
values but clearly sampling the same posterior distribution:

gelman.diag(mcmc.list(m1$Sol, m2$Sol))

Cheers,

Jarrod

Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
18:00:08 +0200:

> Dear Jarrod,
>
> thanks for the quick reply. Please, don't waste time looking into  
> doMPI – I am happy that I
> get the expected result, when I specify that reproducible seed,  
> whyever that may be.
> I'm pretty sure that is the deciding factor, because I tested it  
> explicitly, I just have no idea
> how/why it interacts with the choice of family.
>
> That said, is setting up different RNG streams for my workers (now  
> that it works) __sufficient__
> so that I get independent chains and can use gelman.diag() for  
> convergence diagnostics?
> Or should I still tinker with the starting values myself?
> I've never found a worked example of supplying starting values and  
> am thus a bit lost.
>
> Sorry for sending further questions, I hope someone else takes pity while
> you're busy with lectures.
>
> Best wishes
>
> Ruben
>
>
>
> On 25 Aug 2014, at 17:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>
>> Hi Ruben,
>>
>> I do not think the issue is with the starting values, because even  
>> if the same starting values were used the chains would still differ  
>> because of the randomness in the Markov Chain (if I interpret your  
>> `identical' test correctly). This just involves a call to  
>> GetRNGstate() in the C++ code (L 871 ofMCMCglmm.cc) so I think for  
>> some reason doMPI/foreach is not doing what you expect. I am not  
>> familiar with doMPI and am in the middle of writing lectures so  
>> haven't got time to look into it carefully. Outside of the context  
>> of doMPI I get the behaviour I expect:
>>
>>
>> l<-rnorm(200, -1, sqrt(1))
>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>> y<-rpois(200,exp(l)-t)+1
>> # generate zero-truncated data with an intercept of -1
>>
>> dat<-data.frame(y=y)
>> set.seed(1)
>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>> set.seed(2)
>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>> set.seed(2)
>> m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>
>> plot(mcmc.list(m1$Sol, m2$Sol))
>> # different, as expected
>> plot(mcmc.list(m2$Sol, m3$Sol))
>> # the same, as expected
>>
>>
>>
>>
>>
>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
>> 16:58:06 +0200:
>>
>>> Dear list,
>>>
>>> sorry for bumping my old post, I hope to elicit a response with a  
>>> more focused question:
>>>
>>> When does MCMCglmm automatically start from different values when  
>>> using doMPI/foreach?
>>>
>>> I have done some tests with models of varying complexity. For  
>>> example, the script in my last
>>> post (using "zapoisson") yielded 40 identical chains:
>>>> identical(mcmclist[1], mcmclist[30])
>>> TRUE
>>>
>>> A simpler (?) model (using "ztpoisson" and no specified prior),  
>>> however, yielded different chains
>>> and I could use them to calculate gelman.diag()
>>>
>>> Changing my script to the version below, i.e. seeding foreach  
>>> using .options.mpi=list( seed= 1337)
>>> so as to make RNGstreams reproducible (or so I  thought), led to  
>>> different chains even for the
>>> "zapoisson" model.
>>>
>>> In no case have I (successfully) tried to supplant the default of  
>>> MCMCglmm's "start" argument.
>>> Is starting my models from different RNGsubstreams inadequate  
>>> compared to manipulating
>>> the start argument explicitly? If so, is there any worked example  
>>> of explicit starting value manipulation
>>> in parallel computation?
>>> I've browsed the MCMCglmm source to understand how the default  
>>> starting values are generated,
>>> but didn't find any differences with respect to RNG for the two  
>>> families "ztpoisson" and "zapoisson"
>>> (granted, I did not dig very deep).
>>>
>>> Best regards,
>>>
>>> Ruben Arslan
>>>
>>>
>>> # bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H localhost -n 41 R  
>>> --slave -f  
>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
>>>
>>> library(doMPI)
>>> cl <-  
>>> startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
>>> registerDoMPI(cl)
>>> Children_mcmc1 = foreach(i=1:clusterSize(cl),.options.mpi =  
>>> list(seed=1337) ) %dopar% {
>>> 	library(MCMCglmm);library(data.table)
>>> 	load("/usr/users/rarslan/rpqa/rpqa1.rdata")
>>>
>>> 	nitt = 130000; thin = 100; burnin = 30000
>>> 	prior.m5d.2 = list(
>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>> 	)
>>>
>>> 	rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents, children,  
>>> male, urban, spouses, paternalage.mean, paternalage.factor)])
>>> 	(m1 = MCMCglmm( children ~ trait * (male + urban + spouses +  
>>> paternalage.mean + paternalage.factor),
>>> 						rcov=~us(trait):units,
>>> 						random=~us(trait):idParents,
>>> 						family="zapoisson",
>>> 						prior = prior.m5d.2,
>>> 						data=rpqa.1,
>>> 						pr = F, saveX = F, saveZ = F,
>>> 						nitt=nitt,thin=thin,burnin=burnin))
>>> }
>>>
>>> library(coda)
>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>> save(Children_mcmc1,mcmclist, file =  
>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
>>> closeCluster(cl)
>>> mpi.quit()
>>>
>>>
>>>
>>> On 04 Aug 2014, at 20:25, Ruben Arslan <rubenarslan at gmail.com> wrote:
>>>
>>>> Dear list,
>>>>
>>>> would someone be willing to share her or his efforts in  
>>>> parallelising a MCMCglmm analysis?
>>>>
>>>> I had something viable using harvestr that seemed to properly initialise
>>>> the starting values from different random number streams (which  
>>>> is desirable,
>>>> as far as I could find out), but I ended up being unable to use  
>>>> harvestr, because
>>>> it uses an old version of plyr, where parallelisation works only  
>>>> for multicore, not for
>>>> MPI.
>>>>
>>>> I pasted my working version, that does not do anything about  
>>>> starting values or RNG
>>>> at the end of this email. I can try to fumble further in the dark  
>>>> or try to update harvestr,
>>>> but maybe someone has gone through all this already.
>>>>
>>>> I'd also appreciate any tips for elegantly post-processing such  
>>>> parallel data, as some of my usual
>>>> extraction functions and routines are hampered by the fact that  
>>>> some coda functions
>>>> do not aggregate results over chains. (What I get from a  
>>>> single-chain summary in MCMCglmm
>>>> is a bit more comprehensive, than what I managed to cobble  
>>>> together with my own extraction
>>>> functions).
>>>>
>>>> The reason I'm parallelising my analyses is that I'm having  
>>>> trouble getting a good effective
>>>> sample size for any parameter having to do with the many zeroes  
>>>> in my data.
>>>> Any pointers are very appreciated, I'm quite inexperienced with MCMCglmm.
>>>>
>>>> Best wishes
>>>>
>>>> Ruben
>>>>
>>>> # bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H localhost -n  
>>>> 41 R --slave -f "rpqa/rpqa_main/rpqa_children_parallel.r"
>>>> library(doMPI)
>>>> cl <- startMPIcluster()
>>>> registerDoMPI(cl)
>>>> Children_mcmc1 = foreach(i=1:40) %dopar% {
>>>> 	library(MCMCglmm)
>>>> 	load("rpqa1.rdata")
>>>>
>>>> 	nitt = 40000; thin = 100; burnin = 10000
>>>> 	prior = list(
>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>> 	)
>>>>
>>>> 	MCMCglmm( children ~ trait -1 + at.level(trait,1):male +  
>>>> at.level(trait,1):urban + at.level(trait,1):spouses +  
>>>> at.level(trait,1):paternalage.mean +  
>>>> at.level(trait,1):paternalage.factor,
>>>> 		rcov=~us(trait):units,
>>>> 		random=~us(trait):idParents,
>>>> 		family="zapoisson",
>>>> 		prior = prior,
>>>> 		data=rpqa.1,
>>>> 		pr = F, saveX = T, saveZ = T,
>>>> 		nitt=nitt,thin=thin,burnin=burnin)
>>>> }
>>>>
>>>> library(coda)
>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>> save(Children_mcmc1,mcmclist, file = "rpqa_mcmc_kids_za.rdata")
>>>> closeCluster(cl)
>>>> mpi.quit()
>>>>
>>>>
>>>> --
>>>> Ruben C. Arslan
>>>>
>>>> Georg August University G�ttingen
>>>> Biological Personality Psychology and Psychological Assessment
>>>> Georg Elias M�ller Institute of Psychology
>>>> Go�lerstr. 14
>>>> 37073 G�ttingen
>>>> Germany
>>>> Tel.: +49 551 3920704
>>>> https://psych.uni-goettingen.de/en/biopers/team/arslan
>>>>
>>>>
>>>>
>>>
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.