Have a look at the ChickWeight
dataset on weights of chickens under different diets. (This contains different values compared to chickwts
; R clearly needs two built-in datasets related to chicken weights!) Use, str
, head
, summary
, and any other functions you need to understand the dataset.
data(ChickWeight)
# explore ChickWeight here
str(ChickWeight)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
Here’s some repetitive code to calculate the mean chick weight for each diet.
mean(ChickWeight$weight[ChickWeight$Diet == 1])
## [1] 102.6455
mean(ChickWeight$weight[ChickWeight$Diet == 2])
## [1] 122.6167
mean(ChickWeight$weight[ChickWeight$Diet == 3])
## [1] 142.95
mean(ChickWeight$weight[ChickWeight$Diet == 4])
## [1] 135.2627
Rewrite the code in a way that eliminates the duplication.
(There are lots of ways of doing this in R; if you are feeling confident, see how many different solutions you can find.)
I rather like the dplyr
approach, since the code is very explicit about what is happening. This version use the standard-evaluation versions, which are best for programming with.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ChickWeight %>%
group_by_(~ Diet) %>%
summarise_(MeanWeight = ~ mean(weight))
## Source: local data frame [4 x 2]
##
## Diet MeanWeight
## 1 1 102.6455
## 2 2 122.6167
## 3 3 142.9500
## 4 4 135.2627
The data.table
solution is more compact, though it requires converting the dataset into a data table, and uses some magic commands like .
.
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, last
as.data.table(as.data.frame(ChickWeight))[
j = .(MeanWeight = mean(weight)),
by = .(Diet)
]
## Diet MeanWeight
## 1: 1 102.6455
## 2: 2 122.6167
## 3: 3 142.9500
## 4: 4 135.2627
The plyr
solution is a little more complicated.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
ddply(
ChickWeight,
.(Diet),
summarize,
MeanWeight = mean(weight)
)
## Diet MeanWeight
## 1 1 102.6455
## 2 2 122.6167
## 3 3 142.9500
## 4 4 135.2627
The base R solution of tapply
requires the use of with
to access columns in the data frame, and it returns the result as a named vector rather than a data frame, which is usually less convenient to work with.
with(
ChickWeight,
tapply(weight, Diet, mean)
)
## 1 2 3 4
## 102.6455 122.6167 142.9500 135.2627
What other methods did you think of?
Since chickens grow over time [citation needed], it’s a good idea to get the mean for each diet at each time point. Here, our repetitive code becomes very tedious.
mean(ChickWeight$weight[ChickWeight$Diet == 1 & ChickWeight$Time == 0])
## [1] 41.4
mean(ChickWeight$weight[ChickWeight$Diet == 2 & ChickWeight$Time == 0])
## [1] 40.7
mean(ChickWeight$weight[ChickWeight$Diet == 3 & ChickWeight$Time == 0])
## [1] 40.8
mean(ChickWeight$weight[ChickWeight$Diet == 4 & ChickWeight$Time == 0])
## [1] 41
mean(ChickWeight$weight[ChickWeight$Diet == 1 & ChickWeight$Time == 2])
## [1] 47.25
mean(ChickWeight$weight[ChickWeight$Diet == 2 & ChickWeight$Time == 2])
## [1] 49.4
mean(ChickWeight$weight[ChickWeight$Diet == 3 & ChickWeight$Time == 2])
## [1] 50.4
mean(ChickWeight$weight[ChickWeight$Diet == 4 & ChickWeight$Time == 2])
## [1] 51.8
# ...and so on until day 21
Extend your solution(s) from question 3 to group by both Diet
and Time
.
The first three solutions extend very nicely to multiple grouping conditions.
ChickWeight %>%
group_by_(~ Diet, ~ Time) %>%
summarise_(MeanWeight = ~ mean(weight))
## Source: local data frame [48 x 3]
## Groups: Diet
##
## Diet Time MeanWeight
## 1 1 0 41.40000
## 2 1 2 47.25000
## 3 1 4 56.47368
## 4 1 6 66.78947
## 5 1 8 79.68421
## 6 1 10 93.05263
## 7 1 12 108.52632
## 8 1 14 123.38889
## 9 1 16 144.64706
## 10 1 18 158.94118
## .. ... ... ...
as.data.table(as.data.frame(ChickWeight))[
j = .(MeanWeight = mean(weight)),
by = .(Diet, Time)
]
## Diet Time MeanWeight
## 1: 1 0 41.40000
## 2: 1 2 47.25000
## 3: 1 4 56.47368
## 4: 1 6 66.78947
## 5: 1 8 79.68421
## 6: 1 10 93.05263
## 7: 1 12 108.52632
## 8: 1 14 123.38889
## 9: 1 16 144.64706
## 10: 1 18 158.94118
## 11: 1 20 170.41176
## 12: 1 21 177.75000
## 13: 2 0 40.70000
## 14: 2 2 49.40000
## 15: 2 4 59.80000
## 16: 2 6 75.40000
## 17: 2 8 91.70000
## 18: 2 10 108.50000
## 19: 2 12 131.30000
## 20: 2 14 141.90000
## 21: 2 16 164.70000
## 22: 2 18 187.70000
## 23: 2 20 205.60000
## 24: 2 21 214.70000
## 25: 3 0 40.80000
## 26: 3 2 50.40000
## 27: 3 4 62.20000
## 28: 3 6 77.90000
## 29: 3 8 98.40000
## 30: 3 10 117.10000
## 31: 3 12 144.40000
## 32: 3 14 164.50000
## 33: 3 16 197.40000
## 34: 3 18 233.10000
## 35: 3 20 258.90000
## 36: 3 21 270.30000
## 37: 4 0 41.00000
## 38: 4 2 51.80000
## 39: 4 4 64.50000
## 40: 4 6 83.90000
## 41: 4 8 105.60000
## 42: 4 10 126.00000
## 43: 4 12 151.40000
## 44: 4 14 161.80000
## 45: 4 16 182.00000
## 46: 4 18 202.90000
## 47: 4 20 233.88889
## 48: 4 21 238.55556
## Diet Time MeanWeight
ddply(
ChickWeight,
.(Diet, Time),
summarize,
MeanWeight = mean(weight)
)
## Diet Time MeanWeight
## 1 1 0 41.40000
## 2 1 2 47.25000
## 3 1 4 56.47368
## 4 1 6 66.78947
## 5 1 8 79.68421
## 6 1 10 93.05263
## 7 1 12 108.52632
## 8 1 14 123.38889
## 9 1 16 144.64706
## 10 1 18 158.94118
## 11 1 20 170.41176
## 12 1 21 177.75000
## 13 2 0 40.70000
## 14 2 2 49.40000
## 15 2 4 59.80000
## 16 2 6 75.40000
## 17 2 8 91.70000
## 18 2 10 108.50000
## 19 2 12 131.30000
## 20 2 14 141.90000
## 21 2 16 164.70000
## 22 2 18 187.70000
## 23 2 20 205.60000
## 24 2 21 214.70000
## 25 3 0 40.80000
## 26 3 2 50.40000
## 27 3 4 62.20000
## 28 3 6 77.90000
## 29 3 8 98.40000
## 30 3 10 117.10000
## 31 3 12 144.40000
## 32 3 14 164.50000
## 33 3 16 197.40000
## 34 3 18 233.10000
## 35 3 20 258.90000
## 36 3 21 270.30000
## 37 4 0 41.00000
## 38 4 2 51.80000
## 39 4 4 64.50000
## 40 4 6 83.90000
## 41 4 8 105.60000
## 42 4 10 126.00000
## 43 4 12 151.40000
## 44 4 14 161.80000
## 45 4 16 182.00000
## 46 4 18 202.90000
## 47 4 20 233.88889
## 48 4 21 238.55556
tapply
gets a bit weird when you pass it multiple grouping conditions. It returns a matrix. You’ll likely need to do dome manipulating of it to get the result into the form that you want.
with(
ChickWeight,
tapply(weight, list(Diet, Time), mean)
)
## 0 2 4 6 8 10 12 14
## 1 41.4 47.25 56.47368 66.78947 79.68421 93.05263 108.5263 123.3889
## 2 40.7 49.40 59.80000 75.40000 91.70000 108.50000 131.3000 141.9000
## 3 40.8 50.40 62.20000 77.90000 98.40000 117.10000 144.4000 164.5000
## 4 41.0 51.80 64.50000 83.90000 105.60000 126.00000 151.4000 161.8000
## 16 18 20 21
## 1 144.6471 158.9412 170.4118 177.7500
## 2 164.7000 187.7000 205.6000 214.7000
## 3 197.4000 233.1000 258.9000 270.3000
## 4 182.0000 202.9000 233.8889 238.5556