Когнитивная психология и эмоции
Субъективные заметки аспиранта-психолога
Записи в рубрике «R»

Running Julia’s MixedModels from R

30 июня 2015, анализ данных Метки: , , , ,

Today I had to rerun some scripts to test additional factors with mixed regression. For those of you who haven’t been using it, mixed regression (aka mixed effects regression, hierarchical modeling, or multilevel modeling) is a new sexy way to analyze your data when you have repeated measures (see this or this for some intro). Despite its awesomeness, mixed regression have one major drawback: computations are difficult and your average PC would be a bit slow. I am used to doing statistics interactively, and leaving R to do its stuff even for 10 minutes is frustrating. Leaving it for a couple of hours is definitely too much. So today while lme4 was doing its work, I decided that it’s finally time to test Julia’s magic powers. If mixed models are new and sexy, then Julia is a kind of uncombed but smart and agile teenager (I hope my attempts at metaphors would not be devastated by my «far from perfect» English and you would get the point). It is a three years old programming language made to be a bit like R, but much faster. What’s more interesting, Douglas Bates, one of the main authors of lme4 package in R, have been working for quite some time on a MixedModels package for Julia. The package is still quite underdeveloped and lacks some features of lme4 but the speed, the speed is already amazing. Given that you can run Julia from R with rjulia, one can use the best from both worlds. I made some tests on how fast MixedModels are when compared to lme4. So here are some results:

> ptm <- proc.time()
> lmer_with_julia(lrt~poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf+(1|participant),ds_data[correct==1&dsdf==15,])
 
                                     Estimate   StdError           Z
(Intercept)                        6.69235953 0.03986305 167.8837895
poly(targetDist, 2)1              -0.93507318 0.49478109  -1.8898725
poly(targetDist, 2)2               0.67376146 0.45051646   1.4955313
poly(distr_shift, 2)1             -0.13266734 0.63263396  -0.2097063
poly(distr_shift, 2)2              3.86257519 0.68643291   5.6270251
prev_dsdf10                        0.02124134 0.02187113   0.9712045
poly(distr_shift, 2)1:prev_dsdf10 -0.40239178 0.94527044  -0.4256896
poly(distr_shift, 2)2:prev_dsdf10 -0.59574970 1.10967725  -0.5368675
> proc.time() - ptm
   user  system elapsed 
   0.27    0.07    0.33 
> 
> ptm <- proc.time()
> mod_fit<-lmer(lrt~poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf+(1+poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf|participant),ds_data[correct==1&dsdf==15,])
Warning messages:
1: In commonArgs(par, fn, control, environment()) :
  maxfun < 10 * length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp),  :
  convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 5.46484 (tol = 0.002, component 15)
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge: degenerate  Hessian with 1 negative eigenvalues
> data.frame(Estimate=fixef(mod_fit), StdError = sqrt(diag(vcov(mod_fit))), Z = fixef(mod_fit)/sqrt(diag(vcov(mod_fit))))
                                     Estimate   StdError           Z
(Intercept)                        6.69229926 0.04190106 159.7167138
poly(targetDist, 2)1              -0.94220751 0.49613863  -1.8990811
poly(targetDist, 2)2               0.67036440 0.45581822   1.4706837
poly(distr_shift, 2)1             -0.13513381 0.63298649  -0.2134861
poly(distr_shift, 2)2              3.86885453 0.71171753   5.4359410
prev_dsdf10                        0.02137927 0.02213446   0.9658818
poly(distr_shift, 2)1:prev_dsdf10 -0.38675995 0.98129074  -0.3941339
poly(distr_shift, 2)2:prev_dsdf10 -0.58599673 1.11212726  -0.5269152
> proc.time() - ptm
   user  system elapsed 
  23.68    0.03   23.99

So here MixedModels takes 0.33 s and lme4 takes 23.99 s. Not bad. In addition, lme4 is having troubles with convergence while MixedModels seemingly don’t (I sent a question to r-sig-mixed-models about this, but there’s no answer yet).

> ptm <- proc.time()
> lmer_with_julia(lrt~poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf+(1|participant),ds_data[correct==1,])
 
                                     Estimate   StdError           Z
(Intercept)                        6.64043269 0.03552565 186.9193793
poly(targetDist, 2)1              -1.13069412 0.45134566  -2.5051623
poly(targetDist, 2)2               1.06636970 0.51338074   2.0771517
poly(distr_shift, 2)1              0.07246496 0.68719409   0.1054505
poly(distr_shift, 2)2              5.96863602 0.88864270   6.7165757
prev_dsdf10                       -0.04445310 0.01722907  -2.5801220
prev_dsdf15                       -0.03616479 0.02084635  -1.7348263
poly(distr_shift, 2)1:prev_dsdf10 -1.04818681 0.94317549  -1.1113380
poly(distr_shift, 2)2:prev_dsdf10 -1.22641312 0.94836666  -1.2931846
poly(distr_shift, 2)1:prev_dsdf15 -0.68364579 1.08519143  -0.6299771
poly(distr_shift, 2)2:prev_dsdf15 -1.06375276 1.00142954  -1.0622343
> proc.time() - ptm
   user  system elapsed 
   1.06    0.88    1.59 
> 
> ptm <- proc.time()
> mod_fit<-lmer(lrt~poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf+(1+poly(targetDist,2)+poly(distr_shift,2)*prev_dsdf|participant),ds_data[correct==1,])
Warning messages:
1: In commonArgs(par, fn, control, environment()) :
  maxfun < 10 * length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp),  :
  convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 17.0124 (tol = 0.002, component 58)
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge: degenerate  Hessian with 12 negative eigenvalues
> data.frame(Estimate=fixef(mod_fit), StdError = sqrt(diag(vcov(mod_fit))), Z = fixef(mod_fit)/sqrt(diag(vcov(mod_fit))))
                                     Estimate   StdError            Z
(Intercept)                        6.64061439 0.03584556 185.25625436
poly(targetDist, 2)1              -1.10335265 0.43205622  -2.55372471
poly(targetDist, 2)2               1.07790118 0.47686934   2.26037008
poly(distr_shift, 2)1              0.05756552 0.64110292   0.08979139
poly(distr_shift, 2)2              5.98318507 0.75230028   7.95318738
prev_dsdf10                       -0.04464860 0.01754927  -2.54418516
prev_dsdf15                       -0.03645163 0.02094288  -1.74052671
poly(distr_shift, 2)1:prev_dsdf10 -1.01415019 0.90588970  -1.11950736
poly(distr_shift, 2)2:prev_dsdf10 -1.24803222 0.90816160  -1.37424024
poly(distr_shift, 2)1:prev_dsdf15 -0.65142933 0.92908060  -0.70115480
poly(distr_shift, 2)2:prev_dsdf15 -1.12188710 0.89307721  -1.25620393
> proc.time() - ptm
   user  system elapsed 
 152.00    0.03  152.43

Again, MixedModels wins hands down, and the difference is considerable.

 
> ptm <- proc.time()
> lmer_with_julia(lrt~poly(targetDist,2)*dsdf+poly(distr_shift,2)*prev_dsdf+(1|participant),ds_data[correct==1,])
[1] "mod_fit = fit(lmm(lrt~a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p+(a+b+c+d+e+f+g+h+i+j+k+l+m+n+o+p|participant),mm))"
                                  Estimate StdError      Z
(Intercept)                           6.47     0.03 211.68
poly(targetDist, 2)1                 -0.40     0.66  -0.61
poly(targetDist, 2)2                  0.66     0.62   1.06
dsdf10                                0.12     0.02   6.41
dsdf15                                0.22     0.02   9.80
poly(distr_shift, 2)1                 0.12     0.69   0.18
poly(distr_shift, 2)2                 5.87     0.90   6.50
prev_dsdf10                           0.02     0.02   1.22
prev_dsdf15                           0.07     0.02   3.27
poly(targetDist, 2)1:dsdf10          -0.57     1.02  -0.56
poly(targetDist, 2)2:dsdf10           0.77     1.06   0.72
poly(targetDist, 2)1:dsdf15          -1.38     0.93  -1.47
poly(targetDist, 2)2:dsdf15           0.47     0.95   0.49
poly(distr_shift, 2)1:prev_dsdf10    -1.08     0.93  -1.16
poly(distr_shift, 2)2:prev_dsdf10    -1.15     1.04  -1.11
poly(distr_shift, 2)1:prev_dsdf15    -0.76     1.10  -0.70
poly(distr_shift, 2)2:prev_dsdf15    -0.95     0.98  -0.97
> proc.time() - ptm
   user  system elapsed 
  10.12    5.72   11.89 
> 
> ptm <- proc.time()
> mod_fit<-lmer(lrt~poly(targetDist,2)*dsdf+poly(distr_shift,2)*prev_dsdf+(1+poly(targetDist,2)*dsdf+poly(distr_shift,2)*prev_dsdf|participant),ds_data[correct==1,])
Warning messages:
1: In commonArgs(par, fn, control, environment()) :
  maxfun < 10 * length(par)^2 is not recommended.
2: In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp),  :
  convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 18.1075 (tol = 0.002, component 132)
4: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge: degenerate  Hessian with 39 negative eigenvalues
> res<-data.frame(Estimate=fixef(mod_fit), StdError = sqrt(diag(vcov(mod_fit))), Z = fixef(mod_fit)/sqrt(diag(vcov(mod_fit))))
> round(res, 2)
                                  Estimate StdError      Z
(Intercept)                           6.47     0.03 216.25
poly(targetDist, 2)1                 -0.40     0.62  -0.65
poly(targetDist, 2)2                  0.67     0.63   1.06
dsdf10                                0.12     0.02   6.52
dsdf15                                0.22     0.02  10.16
poly(distr_shift, 2)1                 0.10     0.62   0.16
poly(distr_shift, 2)2                 5.91     0.68   8.73
prev_dsdf10                           0.02     0.02   1.29
prev_dsdf15                           0.07     0.02   3.50
poly(targetDist, 2)1:dsdf10          -0.60     0.87  -0.69
poly(targetDist, 2)2:dsdf10           0.79     0.87   0.91
poly(targetDist, 2)1:dsdf15          -1.34     0.89  -1.52
poly(targetDist, 2)2:dsdf15           0.55     0.88   0.62
poly(distr_shift, 2)1:prev_dsdf10    -1.03     0.87  -1.18
poly(distr_shift, 2)2:prev_dsdf10    -1.16     0.87  -1.33
poly(distr_shift, 2)1:prev_dsdf15    -0.73     0.88  -0.83
poly(distr_shift, 2)2:prev_dsdf15    -1.02     0.88  -1.17
> proc.time() - ptm
   user  system elapsed 
1030.20    0.18 1036.91

And now there’s really a huge difference. Note that the effects are more or less the same, which is good, because it shows that the results provided by MixedModels can be trusted. I added the lmer_with_julia to my functions on github, if you want to test it. It’s not very elaborated: currently, it uses right hand side of the formula to create maximal random effects structure and it also assumes that you’re using data.table.

Видео лекций по статистике и R на русском

6 февраля 2014, анализ данных Метки: , ,

Институт Биоинформатики выложил видеозаписи нашего с Анатолием Карповым курса по статистике и R. Задумка курса заключалась в том, чтобы попробовать объяснять биологам статистику так, как она обычно дается гуманитариям — на простых примерах с минимумом математики. Курс вышел очень сжатым — у нас было всего 14 занятий на теорию и практику с нуля. Надеюсь, несмотря на эту сжатость, все же что-то удалось объяснить.

Если у вас видео не проигрывается, вот прямая ссылка: http://www.youtube.com/watch?v=jE2bgVl2xPk&feature=share&list=PLjKdf6AHvR-GOGfUrUJ8c6La_Fq5s2O0U

R — быстрое считывание и объединение csv файлов

28 июля 2013, анализ данных Метки: ,

Топорный вариант (медленный):

data <- data.frame()
 for (i in file_list){
    data <- rbind.fill(data, read.csv(i, stringsAsFactors=F))
 }
 

Быстрый вариант:

data <- do.call("rbind.fill", lapply(file_list, 
               FUN = function(i) read.csv(i, stringsAsFactors=F))
                 )
 

stringsAsFactors важно, поскольку, если в первом файле представлены не все уровни фактора, при объединении может получиться ерунда

Google Developers опубликовали серию видео по R

12 июля 2013, анализ данных Метки: ,

Видео посвящены базовым моментам работы с R и разного рода программистким штукам, таким как написанию функций. Ниже первое видео, всего их сейчас 21, полный список можно увидеть, нажав на кнопку плейлист.

Grouped forest plots using ggplot2

19 июля 2012, анализ данных Метки: , , , , ,

I decided to write this post in English, as I need to practice in writing more.

As I currently do some meta-analytic stuff, I needed to get a proper plot of the results of analysis. The existing solutions are hard to customize, so I decided to do something by myself. Basically, the forest plot is a errorbar plot of effect sizes, plus a list of studies, plus some other stuff, usually a list of effect sizes.
продолжение »