Frank second book
1
Selling and buying process:
1.1
Salespeople skill/ perception
1.2
Communication - intraorganization
1.3
Communication Interorganization - FLE interactions
1.4
Sales marketing interface
1.5
Firm level impact on salespeople
1.6
Salesperson trait orientation
1.7
Salesperson non-sales activity/ service
1.8
B2G selling
2
Literature Review Note
2.1
Persuasion knowledge Model Review
2.1.1
Summary for PKM in online environment
3
Database marketing substantive domain
3.1
Key Issues
3.1.1
Data Privacy
3.1.2
Customer lifetime value (LTV)
3.2
Method
3.2.1
RFM
3.2.2
Market basket analysis
3.2.3
Collaborative filtering
3.2.4
Cluster analysis
3.2.5
Decision trees
3.2.6
Machine learning
3.3
Sub - substantive areas
3.3.1
Acquisition
3.3.2
Retention/ Churn management
3.3.3
Cross-selling and up-selling
3.3.4
Reward program
3.3.5
Multichannel customer
4
Data Scientist Job
4.1
Business strategy Track (a.k.a Marketing Analytics)
4.1.1
Database marketing
4.1.2
Programming
4.1.3
Statistics:
4.1.4
Visualization
4.1.5
Automation
4.2
Consumer insight Track (aka. Marketing Research)
4.3
Optimization Track (a.k.a Operational Research)
4.3.1
Model optimization
4.3.2
Macro level models
4.3.3
Academic Paper implementation
5
Marketing Strategy PhD skills
5.1
Reading
5.1.1
Meta skill to gain
5.1.2
Reading Purpose
5.1.3
How to train
5.1.4
Suggested reading (optional)
5.2
Writing
6
Applied Stats Model II - Exam 1
6.1
1. (7 points) In a poisson regression problem with one explanatory variable x, the estimate of the β
6.2
2. (5 points) In a generalized linear model, why is the null deviance typically larger than the residual deviance?
6.3
3. (5 points) Can standard linear regression ever be used when the data are counts? Explain why or why not.
6.4
4. Take the following random effect model
6.4.1
6.4.2
(3 points) Find the intra class correlation coefficient.
6.4.3
(2 points) Which level of the random effect α will have the largest predicted value?
6.4.4
(2 points) Predict y for level 6 of the random effect.
6.5
5. Utilizing the below code and output, answer the following:
6.5.1
• (3 points) Why are the residual and null deviance values the same?
6.5.2
• (2 points) Based on these deviance values, does the model “fit” well? Does this make sense?
6.6
6. Take the data given by Problem6Data.csv, which contains a continuous response y and two predictors x1 and x2.
6.6.1
a. (5 points) Read the data into R and make some plots between the response and predictors. Describe the patterns and comment on behaviors you observe with exact 0s.
6.6.2
b. (5 points) Make a histogram of the response variable and comment on the shape. Are the values strictly positive?
6.6.3
c. (4 points) State the range that the index parameter p, for use in the Tweedie distribution, must be contained in. Why is this the case?
6.6.4
d. (6 points) Find an optimal value of p for use in the Tweedie Distribution, using a log link.
6.6.5
e. (5 points) Again using a log link, fit a Tweedie Distribution using the index parameter value you obtained in part d.
6.6.6
f. (6 points) For each of the values of x2 given by the R vector x2_vals <- seq(0, 15, by= .01), use the compound Poisson-Gamma structure to predict the probability of obtaining an exact 0 response when x1 is at its mean value.
6.7
7. The ships dataset in the MASS package provides the number of incidents(indicents) resulting in ship damage as a function of total months of service(service), ship type(type), as well as: • Year of ship construction(year) – Broken into 5 year increments with the year’s variable value indicating the beginning of this period. e.g. year=60 refers to 1960-1964 • Period of operation(period) with realizations: – 60: 1960 - 1974 – 75: 1975 - 1979 The data can be loaded with: data(ships, package=“MASS”)
6.7.1
a. (4 points) Use group_by and summarise to compute the average months of service for each combination of year of construction and period of operation. Explain why there is a 0 in the result.
6.7.2
b. (3 points) Create a new filtered dataset by removing the rows where service=0.
6.7.3
c. (10 points) Fit both a Poisson rate model and Negative Binomial rate model model using number of incidents per service month as the response and all other variables as predictors. Compare which model is better using AIC.
6.7.4
d. (7 points) Using the better model selected in part c, assess the significance of each term using drop1(). Which terms are insignificant? Remove the insignificant variables from the model in c, and refit.
6.7.5
e. (6 points) Statistically compare the residual deviance in your final model from d to the null deviance.
6.7.6
f. (4 points) Compare the AIC from the reduced model in d to the corresponding value from c. Is the change you see expected? Also examine the residual deviance of the model in d, what does it say about the model’s fit to the data?
6.7.7
g. (6 points) Make a plot of the deviance residuals vs the predicted values(link scale). How is the fit?
7
Question 1:
7.1
a. (8 points) Create plots to examine how launch speed and angle may affect the probability of a home run and describe your findings.
7.2
b. (9 points) Fit a logistic regression model with home_run as the response and all other variables as predictors. Conduct a deviance test to assess if this model is better than the null model.
7.3
c. (8 points) Conduct deviance tests with the drop1() function to assess the significance of each individual variable and report the results. Compare the p-values to those obtained from summary().
7.4
d. (6 points) Fit a smaller model after removing all variables which are insignificant using α = 0.05. Compare this model to the larger model, are they significantly different? What are the implications of this with regard to model selection? Until the end of this question, use the smaller model for all analysis
7.5
e. (7 points) How does the launch speed after the ball is hit affect the
odds of HomeRun occurring
? Provide a confidence interval for this value.
7.6
f. (5 points) Using the deviance residuals, make a binned residual vs fitted probability plot and comment on the fit of the model.
7.7
g. (4 points) Using a probability of 0.5 as the threshold for predicting an observation yielding a home run, create a table classifying the predictions against the observed values. Describe your findings. What is the misclassification rate?
7.8
h. (6 points) Using probability thresholds from 0.005 to 0.995, obtain the sensitivities and specificity of the resulting predictions. Create an ROC plot and comment on the effectiveness of the model’s ability to correctly classify the response. As we vary the threshold to determine classifications, is inverse relationship between sensitivity and specificity strongly evident?
7.9
i. (5 points) Produce a plot of the sensitivity and specificity against the threshold. Is there a threshold for classification you would recommend that provides a good balance between the two? Make another confusion matrix using this cutoff, how does the result compare to the previous one? Consider the types of errors you observe.
7.10
j. (3 points) Consider a logistic model with only launch_angle and launch_speed being used to predict the probability of a home run. What is the AIC of this model?
7.11
k. (11 points) Create a dummy variable which is 1 if launch_angle is between 20 and 40 degrees and use this variable in your model instead of the raw value for launch_angle. Then, complete the following: 1. Compare the AIC of this model to the model in part j, which model is better? 2. What does the coefficient of your dummy variable mean? Interpret the value. 3. Interpret the value of the intercept by converting to a probability. Does this result make sense?
7.11.1
8
Question 2
8.1
a. (5 points) Make a plot or set of tables showing the distribution of LSD use between genders and interpret. You can use code similar to that on slide 84 of the course notes.
8.2
b. (9 points) Fit a proportional odds model using LSD as the response variable and the other variables listed above as predictors. Using drop1(), test if the variables are significant or insignificant and describe 2 your results.
8.3
c. (6 points) Interpret the values of the intercepts θj .
8.4
d. (10 points) Print the coefficient table and interpret the values of the significant personality characteristics.
8.5
e. (7250 only) (6 points) Explore interacting some of the categorical demographic variables with the personality measurements and report your findings, does the nature of any personality characteristics affect on LSD usage change according to your analysis?
9
1 1) Modeling the Number of Claims
9.1
a. (5 points) Create a table to present the distribution of the number of claims by computing the proportion of each possibility. What percent of the policies have 0 claims?
9.2
b. (6 points) Create a Poisson regression to predict the number of claims in the year using only the variables for vehicle age, driver age, bonusMalus, and density. What is the deviance? Does is differ significantly from the null deviance?
9.3
c. (4 points) Generate predictions on the response scale µ and round them to the nearest count. Create a table as in part a and comment on how similar or dissimilar this result is.
9.4
d. (3 points) Fit a negative binomial model to the data and generate predictions as in part c. Did this solve the problem?
9.5
e. (5 points) Now fit a zero inflation Poisson model (ZIP) and compute the fitted proportion of zero counts as on slide 38 of the Poisson Notes. Does this help?
9.6
f. (8 points) Interpret the signs of the coefficient estimates out of the ZIP model, both the count portion as well as the zero inflated portion!
10
2) Modeling the Average Claim Payout
10.1
a. (3 points) In this problem we will focus on modeling the positive continuous variable “AvgClaimAmount.” Attempt to fit a Gamma regression with a log link, predicting AvgClaimAmount as a function of all feature variables in the dataset. What happens when you attempt to fit this model?
10.2
b. (5 points) Create a new dataset consisting of the rows that correspond to strictly positive realizations of AvgClaimAmount. Make a histogram of this variable on standard and log scale and describe your findings.
10.3
c. (4 points) Fit the Gamma regression proposed in part a to the filtered dataset. Do you notice anything strange? How many iterations did it take glm() to find this result?
10.4
d. (3 points) To the glm() function, add the argument control = list(maxit = 500). What do you think this will do? Fit the model and examine the summary output to check the number of iterations needed for convergence.
10.5
e.(3 points) From the model in e. Interpret the parameter estimate for a vehicle’s age
11
3. Tweedie
11.1
a. (8 points) Using the entire dataset, create a profile likelihood of the index values p for use in a glm utilizing the Tweedie distribution to predict PurePremium directly. Make a plot of the profile likelihood and select the best value. (7520 - 3 points): For graduate students, I expect some
exploration to find the best value as these data are quite poorly behaved
. Look at the “Value” section of ?tweedie.profile for some hints, and I
would suggest narrowing the search space for p and evaluating it on a somewhat fine grid.
11.2
b. Fit the Tweedie GLM model using the optimal power value you computed from the previous question. The file Insurance_test.csv contains a few additional observations which were not a portion of the data used to fit the previous models. We can read it in and format it using syntax similar to when we started. NewDataPoints <- read.csv(“…./Insurance_test.csv”) NewDataPoints$VehBrand <- as.factor(NewDataPoints$VehBrand) NewDataPoints$VehPower <- as.factor(NewDataPoints$VehPower) NewDataPoints$VehGas <- as.factor(NewDataPoints$VehGas) NewDataPoints$Region <- as.factor(NewDataPoints$Region) NewDataPoints$Area <- as.factor(NewDataPoints$Area)
11.3
c.(7 points) Use your Poisson model from 1b from and your final Gamma model from 2d to generate respective predictions for the number of claims in a year as well as the average amount per claim. Multiply these together and call this product purepremium_prod. Similarly, use your Tweedie model from 3b to directly predict the PurePremium value and store this in purepremium_TW.
12
STAT 4520/7520 - Homework 3 Spring 2022 Due: March 30, 2022
12.1
1) The cake dataset in the lme4 package contains data on an experiment was conducted to determine the effect of recipe and baking temperature on chocolate cake quality. Fifteen batches of cake mix for each recipe were prepared. Each batch was sufficient for six cakes. Each of the six cakes was baked at a different temperature which was randomly assigned. As an indicator of quality, the breakage angle of the cake was recorded.
12.1.1
a. Of the explanatory variables recipe, temp, and replicate, which are fixed and which are random? Explain. Is there any nesting in these variables?
12.1.2
b. Fit a linear model with no random effects for breaking angle against the interaction between recipe and temperature. Which variables are significant? What is the t value for temp the temperature variable? What is the RMSE of this model?
12.1.3
c. Fit a mixed model to the data which takes into account the replicates of the recipes. Examine the variance components. Is there variation in the batches made with each recipe? How does the Residual variance component compare to the MSE from (b)? What is the t-value for temperature?
12.1.4
d. Test for a recipe and temperature effect.
12.1.5
e. Create a QQ plot of the residuals and a residual vs predicted plot for the linear model in (b) and the mixed model in (d), describe how well the model assumptions are satisfied for each.
12.1.6
f. (7510 only) Examine the BLUPs from the mixed model in (d). Do you notice any patterns?
12.2
2) The purpose of random effects is to remove variation from known sources. In this problem, we will use this idea and see how it can work with a popular technique for predictive modeling, the random forest. Random forests were covered in the prerequisite course, and can be fit using the randomForest function in the randomForest library. The dataset Problem-2.csv has a simulated dataset. There is a nonlinear relationship between 6 numeric explanatory variables x1, . . . x6 and the response y, to addition to a grouping effect according to the ID variable, which has 100 levels.
12.2.1
a. Read in the dataset, and split the data into a training and test set, using 95% of the data for the training set. Set a seed of 1 for consistency of results.
12.2.2
b. Fit a random forest model to the training data using y as the response with x1, . . . x6 and ID as the explanatory variables. Use the model to predict y in the test set and compute the test root mean squared error ( RMSE ).
12.2.3
c. Fit the mixed model yij = µ + IDi + εij to the training data. This model does not use the x_i at all, and has a random intercept based on ID. After fitting the model, extract the BLUPs for ID and add them to the constant µˆ, the estimate of the overall intercept. What do these values represent? Store them in a data frame along with a column storing the IDs.
12.2.4
d. Add a column to the test dataset containing the values obtained in part c, joined to the test dataset by matching the IDs. Note: with the dplyr package loaded, you can quickly join by ID using a line like: Data.Test <- left_join(Data.Test, int_plus_blups, by = “ID”) Examine the first few rows and describe what the new column represents.
12.2.5
e. Compute the residuals of the mixed model fit in (c) using the residuals() function, store them in a column of the training data.
12.2.6
f. Fit another random forest model to the training data, using the residuals from part (e) as the response and only x1, . . . x6 as the explanatory variables. Use this model to predict into the test data and store these predictions. What do these predictions represent?
12.2.7
g. Add the predictions in (f) to the values joined to the test data in (d). Consider these values test predictions, and compute the test MSE. Compare this to the test MSE compared in part (b). What do you observe?
13
Applied Stats Model II Lecture Note
13.1
List of Questions:
13.1.1
HW1
13.1.2
HW2
13.1.3
Midterm
13.2
Class content overview:
13.3
GLM-CategoricalData_SP2022
13.4
GLM-CountData_SP2022
13.5
GLMs_AdvancedTopics_SP2022
13.6
MM-RandomEffects_Spring2022
Published with bookdown
ASM_II_Class_Note
Chapter 3
Database marketing substantive domain
3.1
Key Issues
3.1.1
Data Privacy
3.1.2
Customer lifetime value (LTV)
3.2
Method
3.2.1
RFM
3.2.2
Market basket analysis
3.2.3
Collaborative filtering
3.2.4
Cluster analysis
3.2.5
Decision trees
3.2.6
Machine learning
3.3
Sub - substantive areas
3.3.1
Acquisition
3.3.2
Retention/ Churn management
3.3.3
Cross-selling and up-selling
3.3.4
Reward program
3.3.5
Multichannel customer