[R-sig-ME] parallel MCMCglmm, RNGstreams, starting values & priors

Mon Aug 25 18:29:36 CEST 2014

Hi Ruben,

Sorry  - I was wrong when I said that everything is Gibbs sampled  
conditional on the latent variables. The location effects (fixed and  
random effects) are also sampled conditional on the (co)variance  
components so you should add them to the starting values. In the case  
where the true values are used:

m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l,R=1))

Cheers,

Jarrod

Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Mon, 25 Aug 2014  
17:14:14 +0100:

> Hi Ruben,
>
> You will need to provide over-dispersed starting values for  
> multiple-chain convergence diagnostics to be useful (GLMM are so  
> simple I am generally happy if the output of a single run looks  
> reasonable).
>
> With non-Gaussian data everything is Gibbs sampled conditional on  
> the latent variables, so you only need to pass them:
>
> l<-rnorm(200, -1, sqrt(1))
> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
> y<-rpois(200,exp(l)-t)+1
> # generate zero-truncated data with an intercept of -1
>
> dat<-data.frame(y=y)
> set.seed(1)
> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l))
> # use true latent variable as starting values
> set.seed(1)
> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=rnorm(200)))
> # use some very bad starting values
>
> plot(mcmc.list(m1$Sol, m2$Sol))
> # not identical despite the same seed because of different starting  
> values but clearly sampling the same posterior distribution:
>
> gelman.diag(mcmc.list(m1$Sol, m2$Sol))
>
> Cheers,
>
> Jarrod
>
> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
> 18:00:08 +0200:
>
>> Dear Jarrod,
>>
>> thanks for the quick reply. Please, don't waste time looking into  
>> doMPI – I am happy that I
>> get the expected result, when I specify that reproducible seed,  
>> whyever that may be.
>> I'm pretty sure that is the deciding factor, because I tested it  
>> explicitly, I just have no idea
>> how/why it interacts with the choice of family.
>>
>> That said, is setting up different RNG streams for my workers (now  
>> that it works) __sufficient__
>> so that I get independent chains and can use gelman.diag() for  
>> convergence diagnostics?
>> Or should I still tinker with the starting values myself?
>> I've never found a worked example of supplying starting values and  
>> am thus a bit lost.
>>
>> Sorry for sending further questions, I hope someone else takes pity while
>> you're busy with lectures.
>>
>> Best wishes
>>
>> Ruben
>>
>>
>>
>> On 25 Aug 2014, at 17:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>>
>>> Hi Ruben,
>>>
>>> I do not think the issue is with the starting values, because even  
>>> if the same starting values were used the chains would still  
>>> differ because of the randomness in the Markov Chain (if I  
>>> interpret your `identical' test correctly). This just involves a  
>>> call to GetRNGstate() in the C++ code (L 871 ofMCMCglmm.cc) so I  
>>> think for some reason doMPI/foreach is not doing what you expect.  
>>> I am not familiar with doMPI and am in the middle of writing  
>>> lectures so haven't got time to look into it carefully. Outside of  
>>> the context of doMPI I get the behaviour I expect:
>>>
>>>
>>> l<-rnorm(200, -1, sqrt(1))
>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>> y<-rpois(200,exp(l)-t)+1
>>> # generate zero-truncated data with an intercept of -1
>>>
>>> dat<-data.frame(y=y)
>>> set.seed(1)
>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>> set.seed(2)
>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>> set.seed(2)
>>> m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>
>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>> # different, as expected
>>> plot(mcmc.list(m2$Sol, m3$Sol))
>>> # the same, as expected
>>>
>>>
>>>
>>>
>>>
>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
>>> 16:58:06 +0200:
>>>
>>>> Dear list,
>>>>
>>>> sorry for bumping my old post, I hope to elicit a response with a  
>>>> more focused question:
>>>>
>>>> When does MCMCglmm automatically start from different values when  
>>>> using doMPI/foreach?
>>>>
>>>> I have done some tests with models of varying complexity. For  
>>>> example, the script in my last
>>>> post (using "zapoisson") yielded 40 identical chains:
>>>>> identical(mcmclist[1], mcmclist[30])
>>>> TRUE
>>>>
>>>> A simpler (?) model (using "ztpoisson" and no specified prior),  
>>>> however, yielded different chains
>>>> and I could use them to calculate gelman.diag()
>>>>
>>>> Changing my script to the version below, i.e. seeding foreach  
>>>> using .options.mpi=list( seed= 1337)
>>>> so as to make RNGstreams reproducible (or so I  thought), led to  
>>>> different chains even for the
>>>> "zapoisson" model.
>>>>
>>>> In no case have I (successfully) tried to supplant the default of  
>>>> MCMCglmm's "start" argument.
>>>> Is starting my models from different RNGsubstreams inadequate  
>>>> compared to manipulating
>>>> the start argument explicitly? If so, is there any worked example  
>>>> of explicit starting value manipulation
>>>> in parallel computation?
>>>> I've browsed the MCMCglmm source to understand how the default  
>>>> starting values are generated,
>>>> but didn't find any differences with respect to RNG for the two  
>>>> families "ztpoisson" and "zapoisson"
>>>> (granted, I did not dig very deep).
>>>>
>>>> Best regards,
>>>>
>>>> Ruben Arslan
>>>>
>>>>
>>>> # bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H localhost -n 41 R  
>>>> --slave -f  
>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
>>>>
>>>> library(doMPI)
>>>> cl <-  
>>>> startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
>>>> registerDoMPI(cl)
>>>> Children_mcmc1 = foreach(i=1:clusterSize(cl),.options.mpi =  
>>>> list(seed=1337) ) %dopar% {
>>>> 	library(MCMCglmm);library(data.table)
>>>> 	load("/usr/users/rarslan/rpqa/rpqa1.rdata")
>>>>
>>>> 	nitt = 130000; thin = 100; burnin = 30000
>>>> 	prior.m5d.2 = list(
>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>> 	)
>>>>
>>>> 	rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents, children,  
>>>> male, urban, spouses, paternalage.mean, paternalage.factor)])
>>>> 	(m1 = MCMCglmm( children ~ trait * (male + urban + spouses +  
>>>> paternalage.mean + paternalage.factor),
>>>> 						rcov=~us(trait):units,
>>>> 						random=~us(trait):idParents,
>>>> 						family="zapoisson",
>>>> 						prior = prior.m5d.2,
>>>> 						data=rpqa.1,
>>>> 						pr = F, saveX = F, saveZ = F,
>>>> 						nitt=nitt,thin=thin,burnin=burnin))
>>>> }
>>>>
>>>> library(coda)
>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>> save(Children_mcmc1,mcmclist, file =  
>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
>>>> closeCluster(cl)
>>>> mpi.quit()
>>>>
>>>>
>>>>
>>>> On 04 Aug 2014, at 20:25, Ruben Arslan <rubenarslan at gmail.com> wrote:
>>>>
>>>>> Dear list,
>>>>>
>>>>> would someone be willing to share her or his efforts in  
>>>>> parallelising a MCMCglmm analysis?
>>>>>
>>>>> I had something viable using harvestr that seemed to properly initialise
>>>>> the starting values from different random number streams (which  
>>>>> is desirable,
>>>>> as far as I could find out), but I ended up being unable to use  
>>>>> harvestr, because
>>>>> it uses an old version of plyr, where parallelisation works only  
>>>>> for multicore, not for
>>>>> MPI.
>>>>>
>>>>> I pasted my working version, that does not do anything about  
>>>>> starting values or RNG
>>>>> at the end of this email. I can try to fumble further in the  
>>>>> dark or try to update harvestr,
>>>>> but maybe someone has gone through all this already.
>>>>>
>>>>> I'd also appreciate any tips for elegantly post-processing such  
>>>>> parallel data, as some of my usual
>>>>> extraction functions and routines are hampered by the fact that  
>>>>> some coda functions
>>>>> do not aggregate results over chains. (What I get from a  
>>>>> single-chain summary in MCMCglmm
>>>>> is a bit more comprehensive, than what I managed to cobble  
>>>>> together with my own extraction
>>>>> functions).
>>>>>
>>>>> The reason I'm parallelising my analyses is that I'm having  
>>>>> trouble getting a good effective
>>>>> sample size for any parameter having to do with the many zeroes  
>>>>> in my data.
>>>>> Any pointers are very appreciated, I'm quite inexperienced with MCMCglmm.
>>>>>
>>>>> Best wishes
>>>>>
>>>>> Ruben
>>>>>
>>>>> # bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H localhost -n  
>>>>> 41 R --slave -f "rpqa/rpqa_main/rpqa_children_parallel.r"
>>>>> library(doMPI)
>>>>> cl <- startMPIcluster()
>>>>> registerDoMPI(cl)
>>>>> Children_mcmc1 = foreach(i=1:40) %dopar% {
>>>>> 	library(MCMCglmm)
>>>>> 	load("rpqa1.rdata")
>>>>>
>>>>> 	nitt = 40000; thin = 100; burnin = 10000
>>>>> 	prior = list(
>>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>> 	)
>>>>>
>>>>> 	MCMCglmm( children ~ trait -1 + at.level(trait,1):male +  
>>>>> at.level(trait,1):urban + at.level(trait,1):spouses +  
>>>>> at.level(trait,1):paternalage.mean +  
>>>>> at.level(trait,1):paternalage.factor,
>>>>> 		rcov=~us(trait):units,
>>>>> 		random=~us(trait):idParents,
>>>>> 		family="zapoisson",
>>>>> 		prior = prior,
>>>>> 		data=rpqa.1,
>>>>> 		pr = F, saveX = T, saveZ = T,
>>>>> 		nitt=nitt,thin=thin,burnin=burnin)
>>>>> }
>>>>>
>>>>> library(coda)
>>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>> save(Children_mcmc1,mcmclist, file = "rpqa_mcmc_kids_za.rdata")
>>>>> closeCluster(cl)
>>>>> mpi.quit()
>>>>>
>>>>>
>>>>> --
>>>>> Ruben C. Arslan
>>>>>
>>>>> Georg August University G�ttingen
>>>>> Biological Personality Psychology and Psychological Assessment
>>>>> Georg Elias M�ller Institute of Psychology
>>>>> Go�lerstr. 14
>>>>> 37073 G�ttingen
>>>>> Germany
>>>>> Tel.: +49 551 3920704
>>>>> https://psych.uni-goettingen.de/en/biopers/team/arslan
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> 	[[alternative HTML version deleted]]
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>
>>
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.