Chapter 12 STAT 4520/7520 - Homework 3 Spring 2022 Due: March 30, 2022
12.1 1) The cake dataset in the lme4 package contains data on an experiment was conducted to determine the effect of recipe and baking temperature on chocolate cake quality. Fifteen batches of cake mix for each recipe were prepared. Each batch was sufficient for six cakes. Each of the six cakes was baked at a different temperature which was randomly assigned. As an indicator of quality, the breakage angle of the cake was recorded.
Note: temperate randomly assigned. receipt, batches/replicate are nested in receipt!!!
It contains the variables: • replicate: batch number from 1 to 15 • recipe: 3 Recipes, A, B, and C • temperature: temperature at which cake was baked: 175C 185C 195C 205C 215C 225C • angle: a numeric vector giving the angle at which the cake broke. • temp: numeric value of the baking temperature (degrees F). The data can be loaded with the command.
data(cake,package = "lme4")
Additionally, we will only use the numeric version of temperature.
#library(dplyr)
#library(tidyverse)
#??? Why negative before the temperature. what does it mean?
<- cake |> dplyr::select(-temperature) cake
12.1.1 a. Of the explanatory variables recipe, temp, and replicate, which are fixed and which are random? Explain. Is there any nesting in these variables?
The recipe and Temp are fixed, since I am genuinely interested in their different levels. Replicate are random, since it is random sample of levels. I don’t care about the different replications.
The replicate/batch is nested within the recipe.
# random affect has nothing to do with random assignment in actual???
# S39: egg example, is the sample nested within technician? further split GH into G1 G2? not sure what it means. How about here is the temperature also nested??? when is it not nested? the sample are distributed across technician??? or technician switch between labs?
12.1.2 b. Fit a linear model with no random effects for breaking angle against the interaction between recipe and temperature. Which variables are significant? What is the t value for temp the temperature variable? What is the RMSE of this model?
<- lm(angle ~ recipe*temp, cake)
lm_model summary(lm_model)
##
## Call:
## lm(formula = angle ~ recipe * temp, data = cake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.759 -5.279 -1.448 3.019 27.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.379365 9.656850 0.246 0.80557
## recipeB -3.649206 13.656848 -0.267 0.78952
## recipeC -1.941270 13.656848 -0.142 0.88707
## temp 0.153714 0.048109 3.195 0.00157 **
## recipeB:temp 0.010857 0.068037 0.160 0.87334
## recipeC:temp 0.002095 0.068037 0.031 0.97546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.795 on 264 degrees of freedom
## Multiple R-squared: 0.1159, Adjusted R-squared: 0.0992
## F-statistic: 6.925 on 5 and 264 DF, p-value: 4.285e-06
anova(lm_model)
## Analysis of Variance Table
##
## Response: angle
## Df Sum Sq Mean Sq F value Pr(>F)
## recipe 2 135.1 67.54 1.1117 0.3305
## temp 1 1966.7 1966.71 32.3709 3.377e-08 ***
## recipe:temp 2 1.7 0.87 0.0143 0.9858
## Residuals 264 16039.4 60.76
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The temperature variable is significant. The t value is 3.195 The RMSE is 7.79487, the MSE is 60.76
sqrt(60.76)
## [1] 7.79487
12.1.3 c. Fit a mixed model to the data which takes into account the replicates of the recipes. Examine the variance components. Is there variation in the batches made with each recipe? How does the Residual variance component compare to the MSE from (b)? What is the t-value for temperature?
# Test non-nested structure.
#library(lme4)
#mixed_model <- lmer(angle ~ recipe*temp + (1 | replicate), cake)
#summary(mixed_model)
#anova(mixed_model)
# Test the nested structure
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'lme4'
## The following object is masked _by_ '.GlobalEnv':
##
## cake
<- lmer(angle ~ recipe*temp + (1 | recipe) + (1 | recipe:replicate), cake)
mixed_model summary(mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: angle ~ recipe * temp + (1 | recipe) + (1 | recipe:replicate)
## Data: cake
##
## REML criterion at convergence: 1703.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4599 -0.5909 -0.0967 0.6103 2.6578
##
## Random effects:
## Groups Name Variance Std.Dev.
## recipe:replicate (Intercept) 41.7674 6.4628
## recipe (Intercept) 0.5087 0.7132
## Residual 20.8862 4.5701
## Number of obs: 270, groups: recipe:replicate, 45; recipe, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.379365 5.945735 0.400
## recipeB -3.649206 8.408539 -0.434
## recipeC -1.941270 8.408539 -0.231
## temp 0.153714 0.028208 5.449
## recipeB:temp 0.010857 0.039891 0.272
## recipeC:temp 0.002095 0.039891 0.053
##
## Correlation of Fixed Effects:
## (Intr) recipB recipC temp rcpB:t
## recipeB -0.707
## recipeC -0.707 0.500
## temp -0.949 0.671 0.671
## recipeB:tmp 0.671 -0.949 -0.474 -0.707
## recipeC:tmp 0.671 -0.474 -0.949 -0.707 0.500
anova(mixed_model)
## Analysis of Variance Table
## npar Sum Sq Mean Sq F value
## recipe 2 8.89 4.45 0.2129
## temp 1 1966.71 1966.71 94.1631
## recipe:temp 2 1.74 0.87 0.0417
Yes. There is variation in the batches made with each receipt.Since a significant portion of variance comes from the replicate (i.e., 66%) as below.
41.7674 /(41.7674 +20.8862 + 0.5087)
## [1] 0.661271
The residual variance component is smaller now with 20 compared with 60 before. The t value for hte temperature is 5.4 or more significant than before.
# ??? why replicate not receipt? as the levels? ??? why ANOVA table not match the random effect table?
12.1.4 d. Test for a recipe and temperature effect.
library(MCMCglmm)
## Loading required package: coda
## Loading required package: ape
summary(MCMCglmm(angle ~ recipe*temp,random=~replicate, data=cake, verbose=FALSE))$solution
## post.mean l-95% CI u-95% CI eff.samp pMCMC
## (Intercept) 2.144956864 -8.77697344 15.19327939 1000 0.718
## recipeB -3.325454537 -17.45415797 14.74041624 1000 0.740
## recipeC -2.040549300 -19.46512878 11.36814295 1000 0.828
## temp 0.154883860 0.10071669 0.21322479 1000 0.001
## recipeB:temp 0.009252442 -0.06566437 0.09287612 1000 0.856
## recipeC:temp 0.002601202 -0.06744861 0.08693204 1000 0.960
#??? how to specific the random structure here?
There is no simple or interaction effect for recipe. The temperature is siginificant at 99% confidence level with pMCMC at 0.001.
12.1.5 e. Create a QQ plot of the residuals and a residual vs predicted plot for the linear model in (b) and the mixed model in (d), describe how well the model assumptions are satisfied for each.
qqnorm(residuals(lm_model),main="QQ plot of Residuals - Linear model")
qqline(residuals(lm_model))
qqnorm(residuals(mixed_model),main="QQ plot of Residuals - Mixed model")
qqline(residuals(mixed_model))
The mixed model satisfied the normality assumption much better as the points are closer to the line on the QQ Plot.
12.1.6 f. (7510 only) Examine the BLUPs from the mixed model in (d). Do you notice any patterns?
library(lme4)
ranef(mixed_model)$replicate
## NULL
ranef(mixed_model)$recipe
## (Intercept)
## A -3.077238e-14
## B -4.667144e-14
## C -2.051492e-14
The number seems to be close to 0.
#??? what patten central point?variance or intercept?
#??? why replicate null?
#!!! Alternative solution: with nested structure for HW3. replication is nested within recipe.???
12.2 2) The purpose of random effects is to remove variation from known sources. In this problem, we will use this idea and see how it can work with a popular technique for predictive modeling, the random forest. Random forests were covered in the prerequisite course, and can be fit using the randomForest function in the randomForest library. The dataset Problem-2.csv has a simulated dataset. There is a nonlinear relationship between 6 numeric explanatory variables x1, . . . x6 and the response y, to addition to a grouping effect according to the ID variable, which has 100 levels.
set.seed(1)
<- read.csv("E:/Cloud/OneDrive - University of Missouri/Mizzou_PhD/Class plan/Applied Stats Model II/HW3/Problem-2.csv", stringsAsFactors = TRUE) Problem2
12.2.1 a. Read in the dataset, and split the data into a training and test set, using 95% of the data for the training set. Set a seed of 1 for consistency of results.
=sample(1:nrow(Problem2), nrow(Problem2)*0.95)
train= Problem2[train,]
Problem2.train = Problem2[-train,] Problem2.test
12.2.2 b. Fit a random forest model to the training data using y as the response with x1, . . . x6 and ID as the explanatory variables. Use the model to predict y in the test set and compute the test root mean squared error ( RMSE ).
#install.packages("randomForest")
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
=randomForest(y~x1+x2+x3+x4+x5+x6+ID,data=Problem2,subset=train,importance=TRUE) RF_model
$y_RF <- predict(RF_model,newdata=Problem2.test) Problem2.test
sqrt(sum((Problem2.test$y_RF - Problem2.test$y)^2)/175)
## [1] 4.503223
The RMSE is 4.5
12.2.3 c. Fit the mixed model yij = µ + IDi + εij to the training data. This model does not use the x_i at all, and has a random intercept based on ID. After fitting the model, extract the BLUPs for ID and add them to the constant µˆ, the estimate of the overall intercept. What do these values represent? Store them in a data frame along with a column storing the IDs.
<- lmer(y ~ 1 + (1 | ID), Problem2.train)
P2_mixed_model summary(P2_mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ 1 + (1 | ID)
## Data: Problem2.train
##
## REML criterion at convergence: 21249.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9799 -0.6991 -0.0011 0.7052 3.2476
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 7.018 2.649
## Residual 32.786 5.726
## Number of obs: 3325, groups: ID, 100
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 24.5945 0.2829 86.92
#P2_ID_BLUP = ranef(P2_mixed_model)$ID
library(tidyverse)
#S23:
#predict(mixed_model, newdata=data.frame(operator="a"))
$BLUP <- predict(P2_mixed_model, newdata=Problem2.test) Problem2.test
<- fixef(P2_mixed_model) + ranef(P2_mixed_model)$ID
DF_BLUPx $ID <- 1:nrow(DF_BLUPx) DF_BLUPx
The values represent the intercept for each ID.
12.2.4 d. Add a column to the test dataset containing the values obtained in part c, joined to the test dataset by matching the IDs. Note: with the dplyr package loaded, you can quickly join by ID using a line like: Data.Test <- left_join(Data.Test, int_plus_blups, by = “ID”) Examine the first few rows and describe what the new column represents.
<- left_join(Problem2.test, DF_BLUPx, by = "ID") Problem2.test
The new column represents the intercept for each ID in the test dataset.
12.2.5 e. Compute the residuals of the mixed model fit in (c) using the residuals() function, store them in a column of the training data.
$residualx = residuals(P2_mixed_model) Problem2.train
12.2.6 f. Fit another random forest model to the training data, using the residuals from part (e) as the response and only x1, . . . x6 as the explanatory variables. Use this model to predict into the test data and store these predictions. What do these predictions represent?
#why I cannot use subset anymore??? diff dataaset?
=randomForest(residualx~x1+x2+x3+x4+x5+x6,data=Problem2.train,importance=TRUE) RF_residuals_model
$y_RF_residual <- predict(RF_residuals_model,newdata=Problem2.test) Problem2.test
The prediction represents the additional variations beyond the ID’s variation.
12.2.7 g. Add the predictions in (f) to the values joined to the test data in (d). Consider these values test predictions, and compute the test MSE. Compare this to the test MSE compared in part (b). What do you observe?
sqrt(sum((Problem2.test$BLUP + Problem2.test$y_RF_residual - Problem2.test$y)^2)/175)
## [1] 3.608027
3.6-4.5)/4.5 (
## [1] -0.2
The RMSE is reduced from 4.5 to 3.6 or a 20% reduction. The model’s prediction is greatly improved.