Chapter 9 1 1) Modeling the Number of Claims

9.1 a. (5 points) Create a table to present the distribution of the number of claims by computing the proportion of each possibility. What percent of the policies have 0 claims?

Plot_DF

## # A tibble: 4 x 4
##   ClaimNb count etotal proportion
##     <dbl> <int>  <int>      <dbl>
## 1       0 42685  44999   0.949   
## 2       1  2224  44999   0.0494  
## 3       2    84  44999   0.00187 
## 4       3     6  44999   0.000133

The result shows that about 95% of the policies have 0 claims.

9.2 b. (6 points) Create a Poisson regression to predict the number of claims in the year using only the variables for vehicle age, driver age, bonusMalus, and density. What is the deviance? Does is differ significantly from the null deviance?

Poisson_Mod <- glm(ClaimNb ~ VehAge + DrivAge + BonusMalus + Density, family=poisson, Insurance)
summary(Poisson_Mod)

## 
## Call:
## glm(formula = ClaimNb ~ VehAge + DrivAge + BonusMalus + Density, 
##     family = poisson, data = Insurance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1306  -0.3375  -0.3039  -0.2816   4.3804  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.9772852  0.1330830 -37.400  < 2e-16 ***
## VehAge       0.0005833  0.0035970   0.162    0.871    
## DrivAge      0.0110918  0.0014819   7.485 7.17e-14 ***
## BonusMalus   0.0196450  0.0010601  18.532  < 2e-16 ***
## Density      0.0529343  0.0111655   4.741 2.13e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 14381  on 44998  degrees of freedom
## Residual deviance: 14049  on 44994  degrees of freedom
## AIC: 18744
## 
## Number of Fisher Scoring iterations: 6

# S14

pchisq(summary(Poisson_Mod)$deviance,Poisson_Mod$df.residual, lower.tail = FALSE)

## [1] 1

The deviance is 14049. It is not statistically different from the null model.

9.3 c. (4 points) Generate predictions on the response scale µ and round them to the nearest count. Create a table as in part a and comment on how similar or dissimilar this result is.

Predicted_Means <- predict(Poisson_Mod,type = "response")
#X2 <- sum((Insurance$ClaimNb - Predicted_Means)ˆ2/Predicted_Means)
Insurance$Predicted_Class <- round(Predicted_Means,0)

#count(Insurance, Predicted_Class)

The result is not very similar. There are way more 0 then the actual data.

9.4 d. (3 points) Fit a negative binomial model to the data and generate predictions as in part c. Did this solve the problem?

#S26
NegBinom_Mod <- MASS::glm.nb(ClaimNb ~ VehAge + DrivAge + BonusMalus + Density,Insurance)
# Convert back to glm() style object
NegBinom_Mod_glm <- MASS::glm.convert(NegBinom_Mod)

summary(NegBinom_Mod_glm, dispersion=1)

## 
## Call:
## stats::glm(formula = ClaimNb ~ VehAge + DrivAge + BonusMalus + 
##     Density, data = Insurance, family = negative.binomial(theta = 2.89746020004541, 
##     link = log))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0799  -0.3360  -0.3026  -0.2805   4.1113  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.9864644  0.1358338 -36.710  < 2e-16 ***
## VehAge       0.0005578  0.0036355   0.153    0.878    
## DrivAge      0.0111508  0.0015060   7.404 1.32e-13 ***
## BonusMalus   0.0197329  0.0010930  18.053  < 2e-16 ***
## Density      0.0531295  0.0112863   4.707 2.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.8975) family taken to be 1)
## 
##     Null deviance: 13623  on 44998  degrees of freedom
## Residual deviance: 13298  on 44994  degrees of freedom
## AIC: 18738
## 
## Number of Fisher Scoring iterations: 1

Predicted_Means_NB <- predict(NegBinom_Mod,type = "response")

Insurance$Predicted_Class_NB <- round(Predicted_Means_NB,0)

#count(Insurance, Predicted_Class_NB)

The negative Binomial result is quite similar to the Possion. Thus, it still has the issue of over-predicting 0s.

9.5 e. (5 points) Now fit a zero inflation Poisson model (ZIP) and compute the fitted proportion of zero counts as on slide 38 of the Poisson Notes. Does this help?

#S34
#install.packages(pscl)
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

ZIP_Mod <- zeroinfl(ClaimNb ~ VehAge + DrivAge + BonusMalus + Density,Insurance)
summary(ZIP_Mod)

## 
## Call:
## zeroinfl(formula = ClaimNb ~ VehAge + DrivAge + BonusMalus + Density, 
##     data = Insurance)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7772 -0.2379 -0.2141 -0.1965 14.0577 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.350206   0.244515 -17.791  < 2e-16 ***
## VehAge       0.055968   0.009113   6.141 8.17e-10 ***
## DrivAge      0.006835   0.002815   2.428   0.0152 *  
## BonusMalus   0.015519   0.001764   8.796  < 2e-16 ***
## Density      0.011674   0.019009   0.614   0.5391    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.24320    0.90679   0.268  0.78855    
## VehAge       0.20267    0.02203   9.198  < 2e-16 ***
## DrivAge     -0.02088    0.01168  -1.789  0.07368 .  
## BonusMalus  -0.01744    0.00691  -2.524  0.01160 *  
## Density     -0.17969    0.06359  -2.826  0.00471 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 33 
## Log-likelihood: -9333 on 10 Df

#S38 
Pred_zero <- predict(ZIP_Mod,type="zero")
Pred_Count <- predict(ZIP_Mod,type="count")
Total_Prob0 <- Pred_zero + (1-Pred_zero)*exp(-Pred_Count)
mean(Total_Prob0)

## [1] 0.9487571

#count(Pred_zero)

Now it helps. The proportion of 0 is close to the actual data. Moreover, there are more other count (i.e., >0) as well.

9.6 f. (8 points) Interpret the signs of the coefficient estimates out of the ZIP model, both the count portion as well as the zero inflated portion!

data.frame(exp.coef = exp(coef(ZIP_Mod)))

##                     exp.coef
## count_(Intercept) 0.01290416
## count_VehAge      1.05756403
## count_DrivAge     1.00685853
## count_BonusMalus  1.01563970
## count_Density     1.01174223
## zero_(Intercept)  1.27532174
## zero_VehAge       1.22467152
## zero_DrivAge      0.97933313
## zero_BonusMalus   0.98271136
## zero_Density      0.83552783

#??? why it cannot run as S36

The higher the driver age and bonus miles and density reduce the likelihood of fire a claim. The increase in vehicle age increases the likelihood to fire a claim.

The higher the driver age and bonus miles and density increases the number of claims being fired once they fire for a claim. The vehicle age also increases the number of claim to be fired.