1.Motivation

This project is helping Emil City target homeowners who qualify for a home repair tax credit program. This tax credit program has been around for close to twenty years. While the Department of Housing and Community Development (HCD) tries to reach out to eligible homeowners every year proactively, the uptake of the credit is woefully inadequate. Typically only 11% of eligible homeowners reach out to take the credit.
The consensus at HCD is that the low conversion rate is due to the fact that the agency reaches out to eligible homeowners at random. Unfortunately, we don’t know the cost/benefit of previous campaigns, but we should assume it wasn’t good. To move toward a more targeted campaign, this project will try to make a decision-making analytic that can better target limited outreach resources.

options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

housing <- read.csv("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter6/housingSubsidy.csv")

2. Data Visualizations and Exploration

housing  %>%
  dplyr::select(y,unemploy_rate,inflation_rate, spent_on_repairs, age, cons.price.idx, cons.conf.idx) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="enter the program", y="Value", 
           title = "Feature associations with the likelihood of take the Subsidy") +
      theme(legend.position = "none")

It’s hard to interpret data visualizations of count figures. The only thing we can tell is inflation rate might have influences on targets.

housing  %>%
  dplyr::select(y,unemploy_rate,inflation_rate, spent_on_repairs, age, cons.price.idx, campaign) %>%
  gather(Variable, value, -y) %>%
  ggplot() + 
  geom_density(aes(value, color=y), fill = "transparent") + 
  facet_wrap(~Variable, scales = "free") +
  scale_fill_manual(values = palette2) +
  labs(title = "Feature distributions enter vs. no enter",
        subtitle = "(continous outcomes)")

From feature distribution figures, it is possible that age, inflation_rate, spent_on_repairs, unemploy_rate, and some other features look like they have a significant impact on the target around certain values. For example, less than 5100 and more than 5100 spent on repairs might affect the homeowners’ decision on whether enter the program or not.

housing %>%
    dplyr::select(y, job, marital, education, mortgage) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Click", y="Value",
             title = "Feature associations with the likelihood of enter the program",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

housing %>%
    dplyr::select(y,taxbill_in_phl, contact, poutcome,month) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="Click", y="Value",
             title = "Feature associations with the likelihood of enter the program",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

In common sense, age, job, marital status, and education level will influence how people treat the actions (mailers, phone calls, and information/counseling sessions at the HCD offices). Some other features included in this dataset are more like used for marketing analysis. This project will first use all features to make a model.

3. Split data into a 65/35 training/test set.

housing <-
  housing %>%
  dplyr::filter(education !="illiterate")

set.seed(42)
trainIndex <- createDataPartition(housing$y, p = .65,
                                  list = FALSE,
                                  times = 1)
housingTrain <- housing[ trainIndex,]
housingTest  <- housing[-trainIndex,]
housingModel <- glm(y_numeric ~ .,
                  data=housingTrain %>% 
                    dplyr::select( y_numeric,unemploy_rate, cons.conf.idx,marital,age,job,
                                   marital,education,taxLien,mortgage,day_of_week,campaign,
                                   pdays,previous,cons.price.idx,poutcome,
                                   contact,month,taxbill_in_phl),
                  family="binomial" (link="logit"))
testProbs1 <- data.frame(Outcome = as.factor(housingTest$y_numeric),
                        Probs = predict(housingModel, housingTest, type= "response"))

ggplot(testProbs1, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Enter the Program", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

The Sensitivity (True Positive Rate) for a model with all the features is very low.

4. Engineer new features.

a. Interpret new features.

There are tons of job, education, and marital types in our data set. Most have only a handful of observations. So “Enter Average” columns for each feature to use a continuous instead of a categorical variable.
For age, repair cost, campaign, taxLien, and unemployment rate, new “Grade” columns were created for each feature to have meaningful categorical variables depending on the visualization of data figures in Section 2.

#job
housing_job <- 
  housing %>% 
  group_by(job) %>% 
  summarize(totEnters = sum(y_numeric), 
            n = n(), 
            jobEnterAvg = 100*(totEnters/n)) %>%
  dplyr::select(-n, -totEnters) 
housing <- right_join(housing, housing_job)
#education
housing_education <- 
  housing %>% 
  group_by(education) %>% 
  summarize(totEnters = sum(y_numeric), 
            n = n(), 
            eduEnterAvg = 100*(totEnters/n)) %>%
  dplyr::select(-n, -totEnters) 
housing <- right_join(housing, housing_education)
#marital
housing_marital <- 
  housing %>% 
  group_by(marital) %>% 
  summarize(totEnters = sum(y_numeric), 
            n = n(), 
            marEnterAvg = 100*(totEnters/n)) %>%
  dplyr::select(-n, -totEnters) 
housing <- right_join(housing, housing_marital)
#campaign
housing_campaign <- 
  housing %>% 
  group_by(campaign) %>% 
  summarize(totEnters = sum(y_numeric), 
            n = n(), 
            campaignEnterAvg = 100*(totEnters/n)) %>%
  dplyr::select(-n, -totEnters) 
housing <- right_join(housing, housing_campaign)
#season
housing <- 
  housing %>% 
  mutate(season = case_when(month == "apr"|month == "mar"| month == "may"  ~ "Spring",
                            month == "aug"|month == "jun"| month == "jul"  ~ "Summer",
                            month == "sep"|month == "oct"| month == "nov"  ~ "Fall",
                            month == "dec"  ~ "Winter"))
housing<-
  housing %>%
  mutate(ageGrade = case_when(age < 31 ~ "30-",
                              age > 30 & age < 56 ~ "31-55",
                              age > 55 ~ "55+")) %>%
  mutate(repairGrade = case_when(spent_on_repairs < 5100 ~ "5100-",
                                 spent_on_repairs > 5100 ~ "5100+")) %>%
  mutate(inflationGrade = case_when(inflation_rate < 2  ~ "2-",
                                    inflation_rate >= 2 ~ "2+")) %>%
  mutate(unemployGrade = case_when(unemploy_rate < 0 ~ "0-",
                                   unemploy_rate> 0 ~ "0+")) %>%
  mutate(taxLienGrade = case_when(taxLien == "no" ~ 0,
                                  taxLien == "unknown" ~ 0,
                                  taxLien == "yes" ~ 1)) %>%
  mutate(campaignGrade = case_when(campaign < 5 ~ "5-",
                                   campaign > 4 ~ "5+"))
set.seed(42)
trainIndex <- createDataPartition(housing$y, p = .65,
                                  list = FALSE,
                                  times = 1)
housingTrain2 <- housing[ trainIndex,]
housingTest2  <- housing[-trainIndex,]

housingModel2 <- glm(y_numeric ~ .,
                  data=housingTrain2 %>% 
                    dplyr::select(y_numeric, contact,poutcome,cons.price.idx,
                                  cons.conf.idx,inflation_rate,campaignGrade,
                                  taxLienGrade,unemployGrade,repairGrade,ageGrade,
                                  marEnterAvg,eduEnterAvg,jobEnterAvg,month),
                  family="binomial" (link="logit"))

b. Show a regression summary for both regression.

####Summary of all features model

summary(housingModel)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain %>% dplyr::select(y_numeric, unemploy_rate, 
##         cons.conf.idx, marital, age, job, marital, education, 
##         taxLien, mortgage, day_of_week, campaign, pdays, previous, 
##         cons.price.idx, poutcome, contact, month, taxbill_in_phl))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0460  -0.4031  -0.3283  -0.2623   2.9576  
## 
## Coefficients:
##                                 Estimate  Std. Error z value           Pr(>|z|)
## (Intercept)                  -123.201033   19.259004  -6.397 0.0000000001583954
## unemploy_rate                  -0.652220    0.083842  -7.779 0.0000000000000073
## cons.conf.idx                   0.033696    0.019269   1.749           0.080346
## maritalmarried                  0.347779    0.261528   1.330           0.183586
## maritalsingle                   0.549085    0.297897   1.843           0.065299
## maritalunknown                  0.694174    1.252778   0.554           0.579505
## age                             0.023347    0.008427   2.771           0.005594
## jobblue-collar                 -0.496360    0.281348  -1.764           0.077695
## jobentrepreneur                -0.213794    0.434602  -0.492           0.622769
## jobhousemaid                   -0.713347    0.544357  -1.310           0.190047
## jobmanagement                  -0.187270    0.289753  -0.646           0.518079
## jobretired                     -0.765569    0.393457  -1.946           0.051685
## jobself-employed               -0.749024    0.451598  -1.659           0.097195
## jobservices                    -0.175944    0.300973  -0.585           0.558828
## jobstudent                      0.315307    0.429378   0.734           0.462745
## jobtechnician                  -0.077180    0.240852  -0.320           0.748630
## jobunemployed                   0.334214    0.393826   0.849           0.396085
## jobunknown                     -0.212211    0.793790  -0.267           0.789208
## educationbasic.6y              -0.062181    0.422584  -0.147           0.883018
## educationbasic.9y              -0.078182    0.321457  -0.243           0.807842
## educationhigh.school           -0.199638    0.315384  -0.633           0.526733
## educationprofessional.course   -0.117045    0.347107  -0.337           0.735965
## educationuniversity.degree     -0.087695    0.314391  -0.279           0.780293
## educationunknown               -0.199361    0.426140  -0.468           0.639907
## taxLienunknown                 -0.063691    0.221339  -0.288           0.773535
## taxLienyes                     -9.718028  324.744136  -0.030           0.976127
## mortgageunknown                -0.118137    0.522579  -0.226           0.821151
## mortgageyes                    -0.076442    0.144889  -0.528           0.597783
## day_of_weekmon                  0.122430    0.224620   0.545           0.585715
## day_of_weekthu                 -0.006836    0.236110  -0.029           0.976902
## day_of_weektue                  0.146929    0.233992   0.628           0.530055
## day_of_weekwed                  0.200650    0.236997   0.847           0.397199
## campaign                       -0.078360    0.041855  -1.872           0.061184
## pdays                           0.001486    0.001162   1.279           0.200800
## previous                        0.067540    0.194588   0.347           0.728521
## cons.price.idx                  1.279710    0.208807   6.129 0.0000000008861620
## poutcomenonexistent             0.278796    0.325362   0.857           0.391512
## poutcomesuccess                 3.178336    1.151165   2.761           0.005763
## contacttelephone               -0.765198    0.262037  -2.920           0.003498
## monthaug                        0.117902    0.404866   0.291           0.770889
## monthdec                        0.758828    0.650976   1.166           0.243744
## monthjul                       -0.014885    0.373986  -0.040           0.968252
## monthjun                        0.131615    0.350146   0.376           0.707001
## monthmar                        1.892476    0.497091   3.807           0.000141
## monthmay                       -0.271675    0.305262  -0.890           0.373481
## monthnov                       -0.417239    0.370392  -1.126           0.259962
## monthoct                       -0.064352    0.468008  -0.138           0.890635
## monthsep                        0.470126    0.501346   0.938           0.348385
## taxbill_in_phlyes               0.102595    0.200510   0.512           0.608882
##                                 
## (Intercept)                  ***
## unemploy_rate                ***
## cons.conf.idx                .  
## maritalmarried                  
## maritalsingle                .  
## maritalunknown                  
## age                          ** 
## jobblue-collar               .  
## jobentrepreneur                 
## jobhousemaid                    
## jobmanagement                   
## jobretired                   .  
## jobself-employed             .  
## jobservices                     
## jobstudent                      
## jobtechnician                   
## jobunemployed                   
## jobunknown                      
## educationbasic.6y               
## educationbasic.9y               
## educationhigh.school            
## educationprofessional.course    
## educationuniversity.degree      
## educationunknown                
## taxLienunknown                  
## taxLienyes                      
## mortgageunknown                 
## mortgageyes                     
## day_of_weekmon                  
## day_of_weekthu                  
## day_of_weektue                  
## day_of_weekwed                  
## campaign                     .  
## pdays                           
## previous                        
## cons.price.idx               ***
## poutcomenonexistent             
## poutcomesuccess              ** 
## contacttelephone             ** 
## monthaug                        
## monthdec                        
## monthjul                        
## monthjun                        
## monthmar                     ***
## monthmay                        
## monthnov                        
## monthoct                        
## monthsep                        
## taxbill_in_phlyes               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.5  on 2677  degrees of freedom
## Residual deviance: 1454.8  on 2629  degrees of freedom
## AIC: 1552.8
## 
## Number of Fisher Scoring iterations: 11

summary of new feature added model

summary(housingModel2)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = housingTrain2 %>% dplyr::select(y_numeric, contact, 
##         poutcome, cons.price.idx, cons.conf.idx, inflation_rate, 
##         campaignGrade, taxLienGrade, unemployGrade, repairGrade, 
##         ageGrade, marEnterAvg, eduEnterAvg, jobEnterAvg, month))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0248  -0.4027  -0.3441  -0.2715   2.8988  
## 
## Coefficients:
##                       Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)         -109.17469   22.07043  -4.947 0.000000755019 ***
## contacttelephone      -0.66407    0.26031  -2.551       0.010738 *  
## poutcomenonexistent    0.26294    0.22137   1.188       0.234919    
## poutcomesuccess        1.86126    0.30401   6.122 0.000000000922 ***
## cons.price.idx         1.18328    0.24346   4.860 0.000001172121 ***
## cons.conf.idx          0.12251    0.03626   3.378       0.000729 ***
## inflation_rate         1.19789    0.77490   1.546       0.122137    
## campaignGrade5+       -0.91294    0.33369  -2.736       0.006221 ** 
## taxLienGrade          -9.76453  324.74389  -0.030       0.976013    
## unemployGrade0+       -2.32711    0.86217  -2.699       0.006952 ** 
## repairGrade5100+      -4.74109    2.50584  -1.892       0.058488 .  
## ageGrade31-55          0.18375    0.20171   0.911       0.362305    
## ageGrade55+            0.21196    0.29889   0.709       0.478236    
## marEnterAvg            0.04118    0.04973   0.828       0.407546    
## eduEnterAvg            0.01509    0.03275   0.461       0.644970    
## jobEnterAvg            0.02592    0.01718   1.508       0.131472    
## monthaug               0.05169    0.39056   0.132       0.894700    
## monthdec               1.21328    0.66762   1.817       0.069170 .  
## monthjul               0.39140    0.38798   1.009       0.313069    
## monthjun               0.51349    0.34646   1.482       0.138311    
## monthmar               1.83401    0.49443   3.709       0.000208 ***
## monthmay              -0.10656    0.30750  -0.347       0.728947    
## monthnov              -0.19695    0.50009  -0.394       0.693708    
## monthoct              -0.02842    0.46215  -0.062       0.950961    
## monthsep               0.24442    0.51170   0.478       0.632897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.5  on 2677  degrees of freedom
## Residual deviance: 1468.3  on 2653  degrees of freedom
## AIC: 1518.3
## 
## Number of Fisher Scoring iterations: 11

c. Cross Validate and facetted plots of ROC, Sensitivity and Specificity.

all features model cv

testProbs1 <- 
  testProbs1 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs1$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs1$predOutcome, testProbs1$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1261  123
##          1   22   34
##                                              
##                Accuracy : 0.8993             
##                  95% CI : (0.8826, 0.9144)   
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 0.1656             
##                                              
##                   Kappa : 0.2778             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.21656            
##             Specificity : 0.98285            
##          Pos Pred Value : 0.60714            
##          Neg Pred Value : 0.91113            
##              Prevalence : 0.10903            
##          Detection Rate : 0.02361            
##    Detection Prevalence : 0.03889            
##       Balanced Accuracy : 0.59971            
##                                              
##        'Positive' Class : 1                  
## 
auc(testProbs1$Outcome, testProbs1$Probs)
## Area under the curve: 0.79
ctrl1 <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit1 <- train(y~ .,
                  data=housing %>% 
                    dplyr::select(y, unemploy_rate, cons.conf.idx,marital,age,job,marital,education,
                                  taxLien,mortgage,day_of_week, campaign,pdays,previous,
                                  cons.price.idx,poutcome,contact,month,taxbill_in_phl), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl1)

cvFit1
## Generalized Linear Model 
## 
## 4118 samples
##   17 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4076, 4077, 4077, 4077, 4078, ... 
## Resampling results:
## 
##   ROC        Sens       Spec 
##   0.7671145  0.9822673  0.221

new features added model cv

testProbs2 <- data.frame(Outcome = as.factor(housingTest2$y_numeric),
                        Probs = predict(housingModel2, housingTest2, type= "response"))


testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs2$predOutcome, testProbs2$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1260  120
##          1   23   37
##                                                
##                Accuracy : 0.9007               
##                  95% CI : (0.8841, 0.9157)     
##     No Information Rate : 0.891                
##     P-Value [Acc > NIR] : 0.1261               
##                                                
##                   Kappa : 0.2987               
##                                                
##  Mcnemar's Test P-Value : 0.0000000000000009914
##                                                
##             Sensitivity : 0.23567              
##             Specificity : 0.98207              
##          Pos Pred Value : 0.61667              
##          Neg Pred Value : 0.91304              
##              Prevalence : 0.10903              
##          Detection Rate : 0.02569              
##    Detection Prevalence : 0.04167              
##       Balanced Accuracy : 0.60887              
##                                                
##        'Positive' Class : 1                    
## 
auc(testProbs1$Outcome, testProbs1$Probs)
## Area under the curve: 0.79
ctrl2 <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit2 <- train(y~ .,
                  data=housing %>% 
                    dplyr::select(y,contact,poutcome,cons.price.idx,
                                  cons.conf.idx,inflation_rate,campaignGrade,
                                  taxLienGrade,unemployGrade,repairGrade,ageGrade,
                                  marEnterAvg,eduEnterAvg,jobEnterAvg,month), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl2)

cvFit2
## Generalized Linear Model 
## 
## 4118 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4077, 4077, 4076, 4077, 4078, 4077, ... 
## Resampling results:
## 
##   ROC        Sens       Spec  
##   0.7846678  0.9849925  0.2315

ROC, Sensitivity and Specificity of all features model

dplyr::select(cvFit1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

ROC, Sensitivity and Specificity of new features added model

dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines")

Comparing ROC, Sensitivity, and Specificity for two models, the one that added new features slightly increased the accuracy. The “True Positive Rate” for both models is low. It might be caused by low positive in the original dataset (In this case, a very small number of homeowners enter the program), or the features are not highly related to the dependent variable we are trying to model.

5. ROC curve for new model

ggplot(testProbs2, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  geom_abline(slope = 0, intercept = 1, size = 0.5, color = 'black') +
  labs(title = "ROC Curve - enterModel")

It’s better when the yellow line is more curved to the black line: y=1, but it will be overfitting if the yellow line is too close to the black line. Usually ROC curve is more useful to compare with ROC curve of other model for the same dependent variable.

6. Cost/Benefit Analysis

a. The cost/benefit equation for each confusion metric.

  1. True Positive - Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.

    [0.25(66000-5000-2850) - 0.75(2850)] * count = [0.25(58150)-0.75(2850)] * count = [14537.5 - 2137.5] * count = 12400 * count

  2. True Negative - Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.

    0

  3. False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.

    -2850 * count

  4. False Negative - We predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0.

    0

b. Cost/Benefit Table

cost_benefit_table <-
   testProbs2 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", 0 ,
               ifelse(Variable == "True_Positive", Count * 12400 ,
               ifelse(Variable == "False_Negative",0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted no enter",
              "We correctly predicted an enter",
              "We predicted no enter but customer entered",
              "We predicted homeowner enter and homeowner did not")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>% kable_styling()%>%
        footnote(general_title = "\n", general = "Table 6.1")
Cost/Benefit Table
Variable Count Revenue Description
True_Negative 1260 0 We correctly predicted no enter
True_Positive 37 458800 We correctly predicted an enter
False_Negative 120 0 We predicted no enter but customer entered
False_Positive 23 -65550 We predicted homeowner enter and homeowner did not

Table 6.1
iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs2 %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", 0,
               ifelse(Variable == "True_Positive",12400 * Count,
               ifelse(Variable == "False_Negative", 0,
               ifelse(Variable == "False_Positive", -2850 * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }

return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs2)

c. confusion metric outcomes for each Threshold.

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold",
       y = "Revenue") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

d. Threshold as a function of Total_Revenue and Total_Count_of_Credits.

whichThreshold_count <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    filter(.,Variable == "True_Positive") %>%
    summarize(total_credits = (sum(Count))* 5000 * 0.25)

  ggplot(whichThreshold_count)+
  geom_line(aes(x = Threshold, y = total_credits))+
    labs(title = "Total Credits Applied By Threshold For Test Sample")

A higher threshold means fewer positive predictions, which leads to lower true positives (and false positives). So then, the number of 25% of the true positive homeowners will also be fewer, resulting in lower total credits spent for this program.

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Revenue))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
    labs(title = "Model Revenues By Threshold For Test Sample",
         subtitle = "Vertical Line Denotes Optimal Threshold")

When the threshold is 0.13 in this model, the revenue is the highest.

e. Table of the Total_Revenue and Total_Count_of_Credits allocated for 50%_Threshold and Optimal_Threshold.

total_threshold <- merge(x = whichThreshold_count, y = whichThreshold_revenue, by = "Threshold", all = TRUE)
my_threshold <- total_threshold %>%
  filter((Threshold < 0.51 & Threshold > 0.491) |(Threshold < 0.139 & Threshold > 0.129))

kable(my_threshold) %>%
  kable_styling() %>%
  footnote(general_title = "\n",
           general = "Table 6.2")
Threshold total_credits Revenue
0.13 123750 888450
0.50 46250 393250

Table 6.2

7. Conclusion

This model shouldn’t be put into production. The features in this dataset feel more useful for marking analysis. If there are other features related to these homeowners about other aspects, the model might be more accurate. And other campaign information would also be helpful. Also, the dataset has a few homeowners who entered the program. Rare events are difficult to predict and simulate but can be extremely valuable.
The data-level approach involves resampling to reduce the class imbalance. The two commonly used sampling techniques include over-sampling and under-sampling. The algorithmic-level approach leverages machine-learning algorithms that are modified to accommodate imbalanced data. It compensates for the skew by assigning weights to respective classes, introducing biases and penalty constants. Ensemble methods involve a mixture-of-experts approach. These methods combine algorithmic and data approaches to incorporate different misclassification costs for each class in the learning phase. The two most popular ensemble-learning algorithms are Boosting and Bagging. Ensemble approaches rely on combining a large number of relatively weak and simple models to obtain a stronger ensemble prediction. (https://www.mu-sigma.com/our-musings/blog/rare-event-modeling)
In conclusion, rare events are hard to predict but valuable. However, it needs more effort to deal with the problem.