good-code ex 2: removing repeats, split-apply-combine

Have a look at the ChickWeight dataset on weights of chickens under different diets. (This contains different values compared to chickwts; R clearly needs two built-in datasets related to chicken weights!) Use, str, head, summary, and any other functions you need to understand the dataset.

data(ChickWeight)
# explore ChickWeight here
str(ChickWeight)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"
head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Here’s some repetitive code to calculate the mean chick weight for each diet.

mean(ChickWeight$weight[ChickWeight$Diet == 1])
## [1] 102.6455
mean(ChickWeight$weight[ChickWeight$Diet == 2])
## [1] 122.6167
mean(ChickWeight$weight[ChickWeight$Diet == 3])
## [1] 142.95
mean(ChickWeight$weight[ChickWeight$Diet == 4])
## [1] 135.2627

Rewrite the code in a way that eliminates the duplication.

(There are lots of ways of doing this in R; if you are feeling confident, see how many different solutions you can find.)

I rather like the dplyr approach, since the code is very explicit about what is happening. This version use the standard-evaluation versions, which are best for programming with.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ChickWeight %>% 
  group_by_(~ Diet) %>% 
  summarise_(MeanWeight = ~ mean(weight))
## Source: local data frame [4 x 2]
## 
##   Diet MeanWeight
## 1    1   102.6455
## 2    2   122.6167
## 3    3   142.9500
## 4    4   135.2627

The data.table solution is more compact, though it requires converting the dataset into a data table, and uses some magic commands like ..

library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
as.data.table(as.data.frame(ChickWeight))[ 
  j  = .(MeanWeight = mean(weight)),
  by = .(Diet)
]
##    Diet MeanWeight
## 1:    1   102.6455
## 2:    2   122.6167
## 3:    3   142.9500
## 4:    4   135.2627

The plyr solution is a little more complicated.

library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
ddply(
  ChickWeight,
  .(Diet),
  summarize,
  MeanWeight = mean(weight)
)
##   Diet MeanWeight
## 1    1   102.6455
## 2    2   122.6167
## 3    3   142.9500
## 4    4   135.2627

The base R solution of tapply requires the use of with to access columns in the data frame, and it returns the result as a named vector rather than a data frame, which is usually less convenient to work with.

with(
  ChickWeight,
  tapply(weight, Diet, mean)
)
##        1        2        3        4 
## 102.6455 122.6167 142.9500 135.2627

What other methods did you think of?

Since chickens grow over time [citation needed], it’s a good idea to get the mean for each diet at each time point. Here, our repetitive code becomes very tedious.

mean(ChickWeight$weight[ChickWeight$Diet == 1 & ChickWeight$Time == 0])
## [1] 41.4
mean(ChickWeight$weight[ChickWeight$Diet == 2 & ChickWeight$Time == 0])
## [1] 40.7
mean(ChickWeight$weight[ChickWeight$Diet == 3 & ChickWeight$Time == 0])
## [1] 40.8
mean(ChickWeight$weight[ChickWeight$Diet == 4 & ChickWeight$Time == 0])
## [1] 41
mean(ChickWeight$weight[ChickWeight$Diet == 1 & ChickWeight$Time == 2])
## [1] 47.25
mean(ChickWeight$weight[ChickWeight$Diet == 2 & ChickWeight$Time == 2])
## [1] 49.4
mean(ChickWeight$weight[ChickWeight$Diet == 3 & ChickWeight$Time == 2])
## [1] 50.4
mean(ChickWeight$weight[ChickWeight$Diet == 4 & ChickWeight$Time == 2])
## [1] 51.8
# ...and so on until day 21

Extend your solution(s) from question 3 to group by both Diet and Time.

The first three solutions extend very nicely to multiple grouping conditions.

ChickWeight %>% 
  group_by_(~ Diet, ~ Time) %>% 
  summarise_(MeanWeight = ~ mean(weight))
## Source: local data frame [48 x 3]
## Groups: Diet
## 
##    Diet Time MeanWeight
## 1     1    0   41.40000
## 2     1    2   47.25000
## 3     1    4   56.47368
## 4     1    6   66.78947
## 5     1    8   79.68421
## 6     1   10   93.05263
## 7     1   12  108.52632
## 8     1   14  123.38889
## 9     1   16  144.64706
## 10    1   18  158.94118
## ..  ...  ...        ...
as.data.table(as.data.frame(ChickWeight))[ 
  j  = .(MeanWeight = mean(weight)),
  by = .(Diet, Time)
]
##     Diet Time MeanWeight
##  1:    1    0   41.40000
##  2:    1    2   47.25000
##  3:    1    4   56.47368
##  4:    1    6   66.78947
##  5:    1    8   79.68421
##  6:    1   10   93.05263
##  7:    1   12  108.52632
##  8:    1   14  123.38889
##  9:    1   16  144.64706
## 10:    1   18  158.94118
## 11:    1   20  170.41176
## 12:    1   21  177.75000
## 13:    2    0   40.70000
## 14:    2    2   49.40000
## 15:    2    4   59.80000
## 16:    2    6   75.40000
## 17:    2    8   91.70000
## 18:    2   10  108.50000
## 19:    2   12  131.30000
## 20:    2   14  141.90000
## 21:    2   16  164.70000
## 22:    2   18  187.70000
## 23:    2   20  205.60000
## 24:    2   21  214.70000
## 25:    3    0   40.80000
## 26:    3    2   50.40000
## 27:    3    4   62.20000
## 28:    3    6   77.90000
## 29:    3    8   98.40000
## 30:    3   10  117.10000
## 31:    3   12  144.40000
## 32:    3   14  164.50000
## 33:    3   16  197.40000
## 34:    3   18  233.10000
## 35:    3   20  258.90000
## 36:    3   21  270.30000
## 37:    4    0   41.00000
## 38:    4    2   51.80000
## 39:    4    4   64.50000
## 40:    4    6   83.90000
## 41:    4    8  105.60000
## 42:    4   10  126.00000
## 43:    4   12  151.40000
## 44:    4   14  161.80000
## 45:    4   16  182.00000
## 46:    4   18  202.90000
## 47:    4   20  233.88889
## 48:    4   21  238.55556
##     Diet Time MeanWeight
ddply(
  ChickWeight,
  .(Diet, Time),
  summarize,
  MeanWeight = mean(weight)
)
##    Diet Time MeanWeight
## 1     1    0   41.40000
## 2     1    2   47.25000
## 3     1    4   56.47368
## 4     1    6   66.78947
## 5     1    8   79.68421
## 6     1   10   93.05263
## 7     1   12  108.52632
## 8     1   14  123.38889
## 9     1   16  144.64706
## 10    1   18  158.94118
## 11    1   20  170.41176
## 12    1   21  177.75000
## 13    2    0   40.70000
## 14    2    2   49.40000
## 15    2    4   59.80000
## 16    2    6   75.40000
## 17    2    8   91.70000
## 18    2   10  108.50000
## 19    2   12  131.30000
## 20    2   14  141.90000
## 21    2   16  164.70000
## 22    2   18  187.70000
## 23    2   20  205.60000
## 24    2   21  214.70000
## 25    3    0   40.80000
## 26    3    2   50.40000
## 27    3    4   62.20000
## 28    3    6   77.90000
## 29    3    8   98.40000
## 30    3   10  117.10000
## 31    3   12  144.40000
## 32    3   14  164.50000
## 33    3   16  197.40000
## 34    3   18  233.10000
## 35    3   20  258.90000
## 36    3   21  270.30000
## 37    4    0   41.00000
## 38    4    2   51.80000
## 39    4    4   64.50000
## 40    4    6   83.90000
## 41    4    8  105.60000
## 42    4   10  126.00000
## 43    4   12  151.40000
## 44    4   14  161.80000
## 45    4   16  182.00000
## 46    4   18  202.90000
## 47    4   20  233.88889
## 48    4   21  238.55556

tapply gets a bit weird when you pass it multiple grouping conditions. It returns a matrix. You’ll likely need to do dome manipulating of it to get the result into the form that you want.

with(
  ChickWeight,
  tapply(weight, list(Diet, Time), mean)
)
##      0     2        4        6         8        10       12       14
## 1 41.4 47.25 56.47368 66.78947  79.68421  93.05263 108.5263 123.3889
## 2 40.7 49.40 59.80000 75.40000  91.70000 108.50000 131.3000 141.9000
## 3 40.8 50.40 62.20000 77.90000  98.40000 117.10000 144.4000 164.5000
## 4 41.0 51.80 64.50000 83.90000 105.60000 126.00000 151.4000 161.8000
##         16       18       20       21
## 1 144.6471 158.9412 170.4118 177.7500
## 2 164.7000 187.7000 205.6000 214.7000
## 3 197.4000 233.1000 258.9000 270.3000
## 4 182.0000 202.9000 233.8889 238.5556