[R-sig-ME] parallel MCMCglmm, RNGstreams, starting values & priors
Jarrod Hadfield
j.hadfield at ed.ac.uk
Tue Aug 26 13:04:34 CEST 2014
Hi Ruben,
There are 400 liabilities in a zapoisson model (2 per datum). This
code should work:
g <-sample(letters[1:10], size = 200, replace = T)
pred <- rnorm(200)
l1<-rnorm(200, -1, sqrt(1))
l2<-rnorm(200, -1, sqrt(1))
y<-VGAM::rzapois(200, exp(l1), exp(-exp(l2)))
# generate zero-altered data with an intercept of -1 (because the
intercept and variance are the same for both processes this is just
standard Poisson)
dat<-data.frame(y=y, g = g, pred = pred)
start.1<-list(liab=c(l1,l2), R = list(R1=diag(2)), G=list(G1=diag(2)))
prior.1<-list(R=list(V=diag(2), nu=1.002, fix=2),
G=list(G1=list(V=diag(2), nu=2, alpha.mu=c(0,0), alpha.V=diag(2)*1000)))
m1<-MCMCglmm(y~trait + pred:trait, random=~us(trait):g,
family="zapoisson",rcov=~idh(trait):units, data=dat, prior=prior.1,
start= start.1)
However, there are 2 bugs in the current version of MCMCglmm that
return an error message when the documentation implies it should be
fine:
a) it should be possible to have R=diag(2) rather than R =
list(R1=diag(2)). This bug cropped up when I implemented
block-diagonal R structures. It can be fixed by inserting:
if(!is.list(start$R)){
start$R<-list(R1=start$R)
}
on L514 of MCMCglmm.R below
if(!is.list(prior$R[[1]])){
prior$R<-list(R1=prior$R)
}
b) rcov=~trait:units models for zi/za models will return an error when
passing starting values. To fix this insert
if(diagR==3){
if(dim(start)[1]!=1){
stop("V is the wrong dimension for some strart$G/start$R
elements")
}
start<-diag(sum(nfl))*start[1]
}
on L90 of priorfromat.R below
if(is.matrix(start)==FALSE){
start<-as.matrix(start)
}
I will put these in the new version.
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014
21:52:30 +0200:
> Hi Jarrod,
>
> thanks for these pointers.
>
>>> You will need to provide over-dispersed starting values for
>>> multiple-chain convergence diagnostics to be useful (GLMM are so
>>> simple I am generally happy if the output of a single run looks
>>> reasonable).
>
> Oh, I would be happy with single chains, but since computation would
> take weeks this way, I wanted to parallelise and I would use the
> multi-chain convergence as a criterion that my parallelisation was
> proper
> and is as informative as a single long chain. There don't seem to be
> any such checks built-in – I was analysing my 40 chains for a bit
> longer than I like to admit until I noticed they were identical
> (effectiveSize
> and summary.mcmc.list did not yell at me for this).
>
>>> # use some very bad starting values
> I get that these values are bad, but that is the goal for my
> multi-chain aim, right?
>
> I can apply this to my zero-truncated model, but am again getting
> stuck with the zero-altered one.
> Maybe I need only specify the Liab values for this?
> At least I'm getting nowhere with specifying R and G starting values
> here. When I got an error, I always
> went to the MCMCglmm source to understand why the checks failed, but
> I didn't always understand
> what was being checked and couldn't get it to work.
>
> Here's a failing example:
>
> l<-rnorm(200, -1, sqrt(1))
> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
> g = sample(letters[1:10], size = 200, replace = T)
> pred = rnorm(200)
> y<-rpois(200,exp(l)-t)
> y[1:40] = 0
> # generate zero-altered data with an intercept of -1
>
> dat<-data.frame(y=y, g = g, pred = pred)
> set.seed(1)
> start_true = list(Liab=l, R = 1, G = 1 )
> m1<-MCMCglmm(y~1 + pred,random = ~ g,
> family="zapoisson",rcov=~us(trait):units, data=dat, start= start_true)
>
> # use true latent variable as starting values
> set.seed(1)
> # use some very bad starting values
> start_rand = list(Liab=rnorm(200), R = rpois(1, 1)+1, G = rpois(1, 1)+1 )
> m2<-MCMCglmm(y~1 + pred,random = ~ g,rcov=~us(trait):units,
> family="zapoisson", data=dat, start = start_rand)
>
> Best,
>
> Ruben
>
> On 25 Aug 2014, at 18:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>
>> Hi Ruben,
>>
>> Sorry - I was wrong when I said that everything is Gibbs sampled
>> conditional on the latent variables. The location effects (fixed
>> and random effects) are also sampled conditional on the
>> (co)variance components so you should add them to the starting
>> values. In the case where the true values are used:
>>
>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l,R=1))
>>
>> Cheers,
>>
>> Jarrod
>>
>>
>>
>> Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Mon, 25 Aug 2014
>> 17:14:14 +0100:
>>
>>> Hi Ruben,
>>>
>>> You will need to provide over-dispersed starting values for
>>> multiple-chain convergence diagnostics to be useful (GLMM are so
>>> simple I am generally happy if the output of a single run looks
>>> reasonable).
>>>
>>> With non-Gaussian data everything is Gibbs sampled conditional on
>>> the latent variables, so you only need to pass them:
>>>
>>> l<-rnorm(200, -1, sqrt(1))
>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>> y<-rpois(200,exp(l)-t)+1
>>> # generate zero-truncated data with an intercept of -1
>>>
>>> dat<-data.frame(y=y)
>>> set.seed(1)
>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l))
>>> # use true latent variable as starting values
>>> set.seed(1)
>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat,
>>> start=list(Liab=rnorm(200)))
>>> # use some very bad starting values
>>>
>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>> # not identical despite the same seed because of different
>>> starting values but clearly sampling the same posterior
>>> distribution:
>>>
>>> gelman.diag(mcmc.list(m1$Sol, m2$Sol))
>>>
>>> Cheers,
>>>
>>> Jarrod
>>>
>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014
>>> 18:00:08 +0200:
>>>
>>>> Dear Jarrod,
>>>>
>>>> thanks for the quick reply. Please, don't waste time looking into
>>>> doMPI – I am happy that I
>>>> get the expected result, when I specify that reproducible seed,
>>>> whyever that may be.
>>>> I'm pretty sure that is the deciding factor, because I tested it
>>>> explicitly, I just have no idea
>>>> how/why it interacts with the choice of family.
>>>>
>>>> That said, is setting up different RNG streams for my workers
>>>> (now that it works) __sufficient__
>>>> so that I get independent chains and can use gelman.diag() for
>>>> convergence diagnostics?
>>>> Or should I still tinker with the starting values myself?
>>>> I've never found a worked example of supplying starting values
>>>> and am thus a bit lost.
>>>>
>>>> Sorry for sending further questions, I hope someone else takes pity while
>>>> you're busy with lectures.
>>>>
>>>> Best wishes
>>>>
>>>> Ruben
>>>>
>>>>
>>>>
>>>> On 25 Aug 2014, at 17:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>>>>
>>>>> Hi Ruben,
>>>>>
>>>>> I do not think the issue is with the starting values, because
>>>>> even if the same starting values were used the chains would
>>>>> still differ because of the randomness in the Markov Chain (if I
>>>>> interpret your `identical' test correctly). This just involves a
>>>>> call to GetRNGstate() in the C++ code (L 871 ofMCMCglmm.cc) so I
>>>>> think for some reason doMPI/foreach is not doing what you
>>>>> expect. I am not familiar with doMPI and am in the middle of
>>>>> writing lectures so haven't got time to look into it carefully.
>>>>> Outside of the context of doMPI I get the behaviour I expect:
>>>>>
>>>>>
>>>>> l<-rnorm(200, -1, sqrt(1))
>>>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>>>> y<-rpois(200,exp(l)-t)+1
>>>>> # generate zero-truncated data with an intercept of -1
>>>>>
>>>>> dat<-data.frame(y=y)
>>>>> set.seed(1)
>>>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>> set.seed(2)
>>>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>> set.seed(2)
>>>>> m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>>
>>>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>>>> # different, as expected
>>>>> plot(mcmc.list(m2$Sol, m3$Sol))
>>>>> # the same, as expected
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014
>>>>> 16:58:06 +0200:
>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> sorry for bumping my old post, I hope to elicit a response with
>>>>>> a more focused question:
>>>>>>
>>>>>> When does MCMCglmm automatically start from different values
>>>>>> when using doMPI/foreach?
>>>>>>
>>>>>> I have done some tests with models of varying complexity. For
>>>>>> example, the script in my last
>>>>>> post (using "zapoisson") yielded 40 identical chains:
>>>>>>> identical(mcmclist[1], mcmclist[30])
>>>>>> TRUE
>>>>>>
>>>>>> A simpler (?) model (using "ztpoisson" and no specified prior),
>>>>>> however, yielded different chains
>>>>>> and I could use them to calculate gelman.diag()
>>>>>>
>>>>>> Changing my script to the version below, i.e. seeding foreach
>>>>>> using .options.mpi=list( seed= 1337)
>>>>>> so as to make RNGstreams reproducible (or so I thought), led
>>>>>> to different chains even for the
>>>>>> "zapoisson" model.
>>>>>>
>>>>>> In no case have I (successfully) tried to supplant the default
>>>>>> of MCMCglmm's "start" argument.
>>>>>> Is starting my models from different RNGsubstreams inadequate
>>>>>> compared to manipulating
>>>>>> the start argument explicitly? If so, is there any worked
>>>>>> example of explicit starting value manipulation
>>>>>> in parallel computation?
>>>>>> I've browsed the MCMCglmm source to understand how the default
>>>>>> starting values are generated,
>>>>>> but didn't find any differences with respect to RNG for the two
>>>>>> families "ztpoisson" and "zapoisson"
>>>>>> (granted, I did not dig very deep).
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Ruben Arslan
>>>>>>
>>>>>>
>>>>>> # bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H localhost -n 41
>>>>>> R --slave -f
>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
>>>>>>
>>>>>> library(doMPI)
>>>>>> cl <-
>>>>>> startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
>>>>>> registerDoMPI(cl)
>>>>>> Children_mcmc1 = foreach(i=1:clusterSize(cl),.options.mpi =
>>>>>> list(seed=1337) ) %dopar% {
>>>>>> library(MCMCglmm);library(data.table)
>>>>>> load("/usr/users/rarslan/rpqa/rpqa1.rdata")
>>>>>>
>>>>>> nitt = 130000; thin = 100; burnin = 30000
>>>>>> prior.m5d.2 = list(
>>>>>> R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>> G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>> )
>>>>>>
>>>>>> rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents, children,
>>>>>> male, urban, spouses, paternalage.mean, paternalage.factor)])
>>>>>> (m1 = MCMCglmm( children ~ trait * (male + urban + spouses +
>>>>>> paternalage.mean + paternalage.factor),
>>>>>> rcov=~us(trait):units,
>>>>>> random=~us(trait):idParents,
>>>>>> family="zapoisson",
>>>>>> prior = prior.m5d.2,
>>>>>> data=rpqa.1,
>>>>>> pr = F, saveX = F, saveZ = F,
>>>>>> nitt=nitt,thin=thin,burnin=burnin))
>>>>>> }
>>>>>>
>>>>>> library(coda)
>>>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>> save(Children_mcmc1,mcmclist, file =
>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
>>>>>> closeCluster(cl)
>>>>>> mpi.quit()
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04 Aug 2014, at 20:25, Ruben Arslan <rubenarslan at gmail.com> wrote:
>>>>>>
>>>>>>> Dear list,
>>>>>>>
>>>>>>> would someone be willing to share her or his efforts in
>>>>>>> parallelising a MCMCglmm analysis?
>>>>>>>
>>>>>>> I had something viable using harvestr that seemed to properly
>>>>>>> initialise
>>>>>>> the starting values from different random number streams
>>>>>>> (which is desirable,
>>>>>>> as far as I could find out), but I ended up being unable to
>>>>>>> use harvestr, because
>>>>>>> it uses an old version of plyr, where parallelisation works
>>>>>>> only for multicore, not for
>>>>>>> MPI.
>>>>>>>
>>>>>>> I pasted my working version, that does not do anything about
>>>>>>> starting values or RNG
>>>>>>> at the end of this email. I can try to fumble further in the
>>>>>>> dark or try to update harvestr,
>>>>>>> but maybe someone has gone through all this already.
>>>>>>>
>>>>>>> I'd also appreciate any tips for elegantly post-processing
>>>>>>> such parallel data, as some of my usual
>>>>>>> extraction functions and routines are hampered by the fact
>>>>>>> that some coda functions
>>>>>>> do not aggregate results over chains. (What I get from a
>>>>>>> single-chain summary in MCMCglmm
>>>>>>> is a bit more comprehensive, than what I managed to cobble
>>>>>>> together with my own extraction
>>>>>>> functions).
>>>>>>>
>>>>>>> The reason I'm parallelising my analyses is that I'm having
>>>>>>> trouble getting a good effective
>>>>>>> sample size for any parameter having to do with the many
>>>>>>> zeroes in my data.
>>>>>>> Any pointers are very appreciated, I'm quite inexperienced
>>>>>>> with MCMCglmm.
>>>>>>>
>>>>>>> Best wishes
>>>>>>>
>>>>>>> Ruben
>>>>>>>
>>>>>>> # bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H localhost
>>>>>>> -n 41 R --slave -f "rpqa/rpqa_main/rpqa_children_parallel.r"
>>>>>>> library(doMPI)
>>>>>>> cl <- startMPIcluster()
>>>>>>> registerDoMPI(cl)
>>>>>>> Children_mcmc1 = foreach(i=1:40) %dopar% {
>>>>>>> library(MCMCglmm)
>>>>>>> load("rpqa1.rdata")
>>>>>>>
>>>>>>> nitt = 40000; thin = 100; burnin = 10000
>>>>>>> prior = list(
>>>>>>> R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>>> G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>>> )
>>>>>>>
>>>>>>> MCMCglmm( children ~ trait -1 + at.level(trait,1):male +
>>>>>>> at.level(trait,1):urban + at.level(trait,1):spouses +
>>>>>>> at.level(trait,1):paternalage.mean +
>>>>>>> at.level(trait,1):paternalage.factor,
>>>>>>> rcov=~us(trait):units,
>>>>>>> random=~us(trait):idParents,
>>>>>>> family="zapoisson",
>>>>>>> prior = prior,
>>>>>>> data=rpqa.1,
>>>>>>> pr = F, saveX = T, saveZ = T,
>>>>>>> nitt=nitt,thin=thin,burnin=burnin)
>>>>>>> }
>>>>>>>
>>>>>>> library(coda)
>>>>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>>> save(Children_mcmc1,mcmclist, file = "rpqa_mcmc_kids_za.rdata")
>>>>>>> closeCluster(cl)
>>>>>>> mpi.quit()
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ruben C. Arslan
>>>>>>>
>>>>>>> Georg August University G�ttingen
>>>>>>> Biological Personality Psychology and Psychological Assessment
>>>>>>> Georg Elias M�ller Institute of Psychology
>>>>>>> Go�lerstr. 14
>>>>>>> 37073 G�ttingen
>>>>>>> Germany
>>>>>>> Tel.: +49 551 3920704
>>>>>>> https://psych.uni-goettingen.de/en/biopers/team/arslan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the R-sig-mixed-models
mailing list