Variable features

The dependent variable hypothesized in the article is the number of dog bite incidents per zipcode. The predictor variables include the year and month of the incident, characteristics of the dog, and the area where the incident occurred. The year is measured as an interval-level variable, with values from 1 (corresponding to 2015) to 7 (corresponding to 2021). The area of the incident is divided into variables ranging from 1 to 5, representing the five boroughs of New York City. The gender of the dog is a binary variable, with 1 indicating female and 2 indicating male. Similarly, the sterilization status of the dog is also a binary variable, with 0 indicating unsterilized and 1 indicating sterilized.

data <- data.frame(
  `Variable` = c("dog_bite_incidents", "year", "month", "gender", "spay_neuter", "borough"),
  `Range` = c(
    "$\\mathbb{N}$",
    "$\\{1, 2, 3, 4, 5, 6, 7\\}$",
    "$\\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\\}$",
    "$\\{1, 2\\}$",
    "$\\{0, 1\\}$",
    "$\\{1, 2, 3, 4, 5\\}$"
  )
)

knitr::kable(data, format = "pipe", align = "l", caption = "Variable Implications")
Variable Implications
Variable Range
dog_bite_incidents \(\mathbb{N}\)
year \(\{1, 2, 3, 4, 5, 6, 7\}\)
month \(\{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\)
gender \(\{1, 2\}\)
spay_neuter \(\{0, 1\}\)
borough \(\{1, 2, 3, 4, 5\}\)

Negative Binomial Regression Analysis

To measure the combined effects of year, dog background characteristics (i.e., gender and whether neutered) and the Borough they belong to on the incidence of dog bites, we first conduct a negative binomial regression analysis using records of dog bite incidents in New York.

The negative binomial regression model is a type of count model designed for dependent variables that can only take non-negative integer values. From the Spatial Correlations section, the study area meets the three prerequisites for negative binomial regression: the dog bite incidents are not independent, exhibiting spatial clustering; the dependent variable shows overdispersion; and the frequency of dog bite incidents is sufficiently low. Let the dependent variable follows a negative binomial distribution. The model equation is as follows:

\[\ln ({\hat y_i}) = {\beta _0} + {\beta _1}{X_{i1}} + {\beta _2}{X_{i2}} + \cdots {\beta _m}{X_{im}} + \xi\]

Here, \({y_i}\) represents the frequency of dog bite incidents within a year, \({\beta _m}\) is the coefficient, \(X_m\) are the explanatory variables, and \(\xi\) represents the heterogeneity across observations.

In negative binomial regression, where the conditional expectation \(E(Y|X) = e^{\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_m X_{im}}\), the interpretation of the model’s coefficients is as follows: when the explanatory variable \({X_k}\) increases by one unit, the average occurrence rate of dog bite incidents becomes \({e^{{\beta _k}}}\) times what it was originally.

\[ \frac{E(Y|X^*)}{E(Y|X)} = \frac{e^{\beta_0 + \beta_1 X_{i1} + \cdots + \beta_k (X_{ik} + 1) + \cdots + \beta_m X_{im}}}{e^{\beta_0 + \beta_1 X_{i1} + \cdots + \beta_k X_{ik} + \cdots + \beta_m X_{im}}} = e^{\beta_k} \]

Regularized Model (Lasso)

Initially, the paper employs Lasso regression on five primary variables, aiming to use regularization to reduce the number of variables in the model and thus avoid overfitting.

# lasso

x = model.matrix(bite_count ~ year + month + gender + spay_neuter + borough, data = bites_by_group)
y = bites_by_group |> pull(bite_count)

cv_model = cv.glmnet(x, y, family = "poisson", alpha = 1)  # Lasso:alpha = 1

cv_results <- with(cv_model, 
  data.frame(
  log_lambda = log(lambda),
  mean_mse = cvm,
  lower_mse = cvlo,
  upper_mse = cvup
))

best_lambda <- cv_model$lambda.min
log_best_lambda <- log(best_lambda)

\(\lambda\) is the regularization parameter in Lasso regression, controlling variable selection and the sparsity of the model. Finding the optimal \(\lambda\) value is crucial as it balances model complexity and predictive performance. The ideal \(\lambda\) value, derived from cross-validation, minimizes model error. For this study, the optimal \(\lambda\) has been determined to be 0.005529689.

ggplot(cv_results, aes(x = log_lambda, y = mean_mse)) +
  geom_line(color = "blue", size = 1) +
  geom_ribbon(aes(ymin = lower_mse, ymax = upper_mse), alpha = 0.2, fill = "blue") +
  geom_vline(xintercept = log_best_lambda, linetype = "dashed", color = "red") +
  labs(
    title = "Cross-Validation for Lasso Regression",
    x = "Log(Lambda)",
    y = "Mean Cross-Validated Error"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

lasso_coefficients <- coef(cv_model, s = "lambda.min")

lasso_coefficients_df <- as.data.frame(as.matrix(lasso_coefficients)) %>%
  rownames_to_column("Variable") %>%
  rename(Coefficient = 1) %>%
  filter(Coefficient != 0)

kable(
  lasso_coefficients_df,
  col.names = c("Term", "Coefficient"),
  caption = "Cross-Validation Results for Lasso Regression",
  align = "c",
  digits = 4
) |> 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = TRUE
  ) |> 
  scroll_box(height = "300px")
Cross-Validation Results for Lasso Regression
Term Coefficient
X.Intercept. 1.1070
X.Intercept..1 0.0000
year2016 0.0589
year2017 0.1471
year2018 0.0711
year2019 0.0949
year2020 -0.2819
year2021 -0.2845
month2 -0.0167
month3 0.0000
month4 0.1036
month5 0.1568
month6 0.2253
month7 0.1832
month8 0.2052
month9 0.1186
month10 0.0678
month11 0.0588
month12 -0.0010
genderM 0.8880
spay_neuterNeutered -0.1034
boroughBrooklyn 0.2990
boroughManhattan 0.2263
boroughQueens 0.4942
boroughStaten.Island -0.5573

In summary, the Lasso regression model selected the following variables:

  • Time factor: year and month
  • Individual characteristics: gender and spay_neuter
  • Regional characteristics: borough

Therefore, these variables need to be included in the subsequent negative binomial regression analysis.

Main effects model

After employing Lasso regression for feature selection, the study identified several key variables that influence the occurrence of dog bite incidents. These variables include the year, month, gender, spay_neuter, and borough. These factors will be incorporated into a negative binomial generalized linear model (glm.nb) to predict the number of bite incidents.

Results

The results of the main effects model fitting are presented in the following table.

The dispersion parameter, \(\theta = 25.18\), indicates that the data exhibits overdispersion, meaning the variance is much larger than the mean. This justifies the use of the negative binomial regression model over a Poisson regression.

The deviance of the null model (a model with no predictors, only the intercept) is \(3914.5\). This represents the deviance of the fitted model with predictors included is \(1549.7\). The significant reduction in deviance indicates that the included predictors substantially improve model fit.

kable(
  model_summary,
  col.names = c("Term", "Estimate", "SE", "Statistic", "P-value", "Significance"),
  caption = "Summary of Negative Binomial Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  scroll_box(height = "400px")
Summary of Negative Binomial Regression Results
Term Estimate SE Statistic P-value Significance
(Intercept) 1.0819840 0.0594554 18.1982535 0.0000000 ***
year2016 0.0697408 0.0415739 1.6775135 0.0934421
year2017 0.1535577 0.0409067 3.7538482 0.0001741 ***
year2018 0.0752355 0.0415347 1.8113915 0.0700803
year2019 0.0942075 0.0413851 2.2763627 0.0228243
year2020 -0.2791303 0.0454262 -6.1446913 0.0000000 ***
year2021 -0.2870956 0.0458795 -6.2576004 0.0000000 ***
month2 0.0025297 0.0587213 0.0430799 0.9656378
month3 0.0299763 0.0582325 0.5147697 0.6067140
month4 0.1274838 0.0571182 2.2319298 0.0256196
month5 0.1807138 0.0565828 3.1937937 0.0014042 **
month6 0.2447988 0.0559689 4.3738342 0.0000122 ***
month7 0.2019422 0.0565467 3.5712485 0.0003553 ***
month8 0.2276408 0.0563495 4.0397994 0.0000535 ***
month9 0.1429531 0.0571015 2.5034925 0.0122974
month10 0.0912030 0.0577714 1.5786867 0.1144079
month11 0.0888914 0.0577247 1.5399200 0.1235799
month12 0.0185023 0.0588556 0.3143674 0.7532420
genderM 0.8851206 0.0245208 36.0966638 0.0000000 ***
spay_neuterNeutered -0.1122779 0.0228317 -4.9176281 0.0000009 ***
boroughBrooklyn 0.3051114 0.0361136 8.4486684 0.0000000 ***
boroughManhattan 0.2376617 0.0365134 6.5088955 0.0000000 ***
boroughQueens 0.4981809 0.0349699 14.2459962 0.0000000 ***
boroughStaten Island -0.5517553 0.0446767 -12.3499477 0.0000000 ***

Year: Some years (e.g., 2017, 2019) show statistically significant coefficients, suggesting that the number of bite incidents varies significantly across years. year2021 (Coefficient = -0.28710, p < 0.001) indicates a significant decrease in bite counts compared to year2015.

Month: Some months show no significant effects on bite counts (e.g., month2: p = 0.97; month3: p = 0.61). Several months (e.g., May, June, August) have significant coefficients, indicating seasonality in bite incidents.

Gender: Male dogs (genderM) are associated with a significantly higher bite count, as indicated by the large positive coefficient.

Spay/Neuter: Neutered dogs are associated with a lower bite count, with a statistically significant negative coefficient.

Boroughs: Bite counts vary across boroughs, with Staten Island showing a significant negative association compared to the reference borough.

From the coefficient of fitting results, this study draws the following conclusions:

  • Different years have significantly impacted bite counts, particularly after 2020, where a notable reduction in incidents was observed. This might reflect external factors, such as the COVID-19 pandemic.

  • Monthly effects highlight significant seasonal variation, with summer months (e.g., June) showing higher bite counts.

  • Males are significantly more likely to be involved in bite incidents than females.

  • Neutered animals have significantly lower bite risks compared to non-neutered ones.

  • Brooklyn has significantly higher bite counts, while Staten Island has significantly fewer bites compared to Bronx. These differences might reflect variations in dog management policies, population density, or dog ownership practices across regions.

Evaluation

After obtaining the fitted results, it is crucial to perform a multicollinearity test on the model to ensure the reliability of the regression coefficients. Multicollinearity can severely affect the interpretability of the model and can lead to overinflated standard errors, making it difficult to discern the true effect of each predictor.

vif_results <- vif(model) %>%
  as.data.frame() %>%
  rownames_to_column("Predictor") %>%
  rename(
    GVIF = GVIF,
    DF = Df,
    `GVIF^(1/(2*DF))` = `GVIF^(1/(2*Df))`
  )

kable(
  vif_results,
  col.names = c("Predictor", "GVIF", "DF", "Scaled GVIF"),
  caption = "Multicollinearity Check (GVIF)"
) |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
Multicollinearity Check (GVIF)
Predictor GVIF DF Scaled GVIF
year 1.004775 6 1.000397
month 1.002312 11 1.000105
gender 1.003175 1 1.001586
spay_neuter 1.001316 1 1.000658
borough 1.005447 4 1.000679

Generalized Variance Inflation Factor (GVIF) measures the multicollinearity in the regression model. A GVIF value greater than 10 generally indicates a high multicollinearity issue. All GVIF values are close to 1, suggesting no significant multicollinearity.

Scaled GVIF adjusts GVIF based on the degrees of freedom to make it comparable across variables. It is defined as \(GVIF^{(1/(2*Df))}\). As shown in the table above, the GVIF and scaled GVIF are all close to 1, indicating that there is no significant multicollinearity in the final model.

As shown in the table above, results does not exhibit multicollinearity among the predictors, as all GVIF values and their scaled counterparts are well below the threshold of concern (commonly 10 or higher). This indicates that all predictors can be reliably used in regression model without the risk of biased coefficients caused by multicollinearity.

residuals_sim = simulateResiduals(model)
plot(residuals_sim)

Q-Q Plot:

  • KS Test p-value (0.7167): Indicates that the residuals do not significantly deviate from a uniform distribution. This suggests that the model residuals are well-behaved.
  • Dispersion Test p-value (0.936): Suggests there is no significant over- or under-dispersion in the model, further supporting the appropriateness of the negative binomial distribution.
  • Outlier Test p-value (0.19983): Shows that there are no significant outliers in the residuals, meaning that most predictions align with the observed data.

The residuals closely follow the expected line, indicating a good fit in terms of normality for the main effects-only model.

Residuals vs Predicted:

  • The residuals appear to be well-distributed, with most points clustering around the 0.5 quantile line.
  • The red curves (quantile deviations) show minor deviations, particularly near the tails (e.g., quantiles close to 0 and 1). However, these deviations are not substantial enough to indicate severe model misspecification.
  • The presence of small red asterisks (*) at the upper quantiles suggests slight issues with model fit in extreme predictions, but these deviations are minor.

This shows some spread and potential systematic patterns, indicating the main effects model might not fully capture the variability in the data.

Model with interaction

To further increase the precision of the model fit, interaction terms between variables were added.

Results

model_all_interactions <- glm.nb(
  formula = bite_count ~ 
    year + month + gender + spay_neuter + borough +
    year * month + year * gender + year * spay_neuter + year * borough +
    month * gender + month * spay_neuter + month * borough +
    gender * spay_neuter + gender * borough + spay_neuter * borough +
    year * month * gender + year * month * spay_neuter + year * month * borough + 
    year * gender * spay_neuter + year * gender * borough + year * spay_neuter * borough +
    month * gender * spay_neuter + month * gender * borough + month * spay_neuter * borough +
    gender * spay_neuter * borough,
  data = bites_by_group
)

model_stepwise <- stepAIC(model_all_interactions, direction = "both")

After using stepwise reduction to simplify the model complexity, the variables of final negative binomial regression model are as follow:

data <- data.frame(
  `Main Effect` = c("year", "month", "gender", "spay_neuter", "borough", " "),
  `Double Interaction Effect` = c(
    "year * spay_neuter",
    "year * borough",
    "month * spay_neuter",
    "gender * spay_neuter",
    "gender * borough",
    "spay_neuter * borough"
  )
)

knitr::kable(data, format = "pipe", align = "l", caption = "Final Negative Binomial Regression Model", escape = FALSE)
Final Negative Binomial Regression Model
Main.Effect Double.Interaction.Effect
year year * spay_neuter
month year * borough
gender month * spay_neuter
spay_neuter gender * spay_neuter
borough gender * borough
spay_neuter * borough

The final model with interaction is detailed in the following table. Firstly, the intercept is 3.83 with a p-value near zero, indicating that the baseline level of the response variable (on a logarithmic scale) is significant when other variables are not considered. For the main effect variables, the years 2016, 2018, and 2019 show a significant increase in the log change of event counts compared to 2015; male dogs, compared to female dogs, exhibit stronger aggression, with a 10.59 times increase in the incidence of dog bite events, calculated as with \(e^{2.36}\). The coefficient for neutering, 0.3960, indicates a significant reduction in the log count of bite incidents for neutered dogs compared to unneutered dogs. Geographically, the coefficients for Manhattan and Staten Island are 0.72 and 0.48, respectively, showing significantly lower event counts compared to the Bronx. Regarding interaction terms, the years 2018, 2020, and 2021 show a significant reduction in interaction with the neutering status.

filter_model = glm.nb(bite_count ~ year + month + gender + spay_neuter + 
    borough + year:spay_neuter + year:borough + month:spay_neuter + 
    gender:spay_neuter + gender:borough + spay_neuter:borough, 
    data = bites_by_group)

filter_model_summary <- tidy(filter_model, exponentiate = TRUE, conf.int = TRUE)

kable(
  filter_model_summary,
  caption = "Summary of Negative Binomial Regression (with interactions) Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  scroll_box(height = "400px")
Summary of Negative Binomial Regression (with interactions) Results
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3.8266218 0.0917147 14.6321455 0.0000000 3.1897224 4.5699845
year2016 1.3033610 0.0901036 2.9404628 0.0032772 1.0929752 1.5562867
year2017 1.1385007 0.0922903 1.4054815 0.1598781 0.9504021 1.3649459
year2018 1.3118411 0.0908169 2.9887793 0.0028009 1.0984502 1.5684738
year2019 1.3122458 0.0905443 3.0011837 0.0026893 1.0994102 1.5681645
year2020 0.8228914 0.1047046 -1.8617234 0.0626421 0.6693712 1.0093955
year2021 0.8059884 0.1072938 -2.0102369 0.0444061 0.6520723 0.9933690
month2 0.9729195 0.0695666 -0.3946429 0.6931064 0.8487826 1.1149947
month3 1.0105442 0.0684963 0.1531328 0.8782936 0.8835430 1.1557926
month4 0.9934017 0.0684907 -0.0966578 0.9229981 0.8685643 1.1361738
month5 1.0680211 0.0672828 0.9780735 0.3280380 0.9361209 1.2187538
month6 1.1690766 0.0657587 2.3755679 0.0175220 1.0278987 1.3302616
month7 1.2098715 0.0654067 2.9127615 0.0035825 1.0645334 1.3757713
month8 1.1509696 0.0664863 2.1147913 0.0344477 1.0104735 1.3114496
month9 1.0335144 0.0680631 0.4843299 0.6281517 0.9044289 1.1811049
month10 1.0959521 0.0673646 1.3601127 0.1737943 0.9604423 1.2508199
month11 1.0178300 0.0683075 0.2587259 0.7958467 0.8902585 1.1637148
month12 0.9873063 0.0689063 -0.1853956 0.8529188 0.8625001 1.1300826
genderM 2.3603576 0.0546184 15.7238806 0.0000000 2.1222253 2.6290269
spay_neuterNeutered 0.3960162 0.1090476 -8.4944592 0.0000000 0.3196063 0.4900996
boroughBrooklyn 1.1484679 0.1007295 1.3742626 0.1693602 0.9433984 1.4003372
boroughManhattan 0.7189159 0.1083180 -3.0466848 0.0023138 0.5815085 0.8892388
boroughQueens 1.0052349 0.1004848 0.0519610 0.9585598 0.8261307 1.2250977
boroughStaten Island 0.4792828 0.1295018 -5.6791829 0.0000000 0.3711956 0.6168309
year2016:spay_neuterNeutered 1.0612336 0.0738846 0.8043893 0.4211722 0.9181842 1.2266608
year2017:spay_neuterNeutered 1.1261719 0.0722120 1.6454912 0.0998686 0.9775818 1.2974768
year2018:spay_neuterNeutered 0.8361551 0.0738241 -2.4238853 0.0153555 0.7234554 0.9662810
year2019:spay_neuterNeutered 0.8999070 0.0732977 -1.4388429 0.1501950 0.7794484 1.0389210
year2020:spay_neuterNeutered 0.7093584 0.0832625 -4.1242377 0.0000372 0.6023798 0.8349059
year2021:spay_neuterNeutered 0.7033678 0.0842494 -4.1765924 0.0000296 0.5961124 0.8294237
year2016:boroughBrooklyn 0.6696158 0.1136113 -3.5300282 0.0004155 0.5356558 0.8362826
year2017:boroughBrooklyn 0.8016684 0.1146289 -1.9284860 0.0537947 0.6401252 1.0033867
year2018:boroughBrooklyn 0.6156245 0.1168101 -4.1530513 0.0000328 0.4893635 0.7736536
year2019:boroughBrooklyn 0.6430267 0.1158853 -3.8103961 0.0001387 0.5120816 0.8066385
year2020:boroughBrooklyn 0.9194226 0.1308903 -0.6418304 0.5209833 0.7116333 1.1889894
year2021:boroughBrooklyn 0.9169286 0.1340716 -0.6468607 0.5177221 0.7053947 1.1933845
year2016:boroughManhattan 0.7672008 0.1219690 -2.1727381 0.0298000 0.6038442 0.9741476
year2017:boroughManhattan 1.0056926 0.1214035 0.0467566 0.9627072 0.7926301 1.2758780
year2018:boroughManhattan 1.0764166 0.1200012 0.6136403 0.5394531 0.8507054 1.3618503
year2019:boroughManhattan 1.0374713 0.1197084 0.3072990 0.7586158 0.8203777 1.3117924
year2020:boroughManhattan 1.3188623 0.1365701 2.0265749 0.0427059 1.0096615 1.7248937
year2021:boroughManhattan 1.2422379 0.1397488 1.5521743 0.1206205 0.9452097 1.6350671
year2016:boroughQueens 0.8140961 0.1136471 -1.8097847 0.0703292 0.6512614 1.0169129
year2017:boroughQueens 1.0867730 0.1135860 0.7325963 0.4638047 0.8696893 1.3576659
year2018:boroughQueens 1.0448140 0.1131393 0.3874768 0.6984032 0.8367859 1.3040097
year2019:boroughQueens 1.0220010 0.1127442 0.1930255 0.8469390 0.8191320 1.2745219
year2020:boroughQueens 1.1261820 0.1306208 0.9097565 0.3629509 0.8722475 1.4557937
year2021:boroughQueens 1.2916043 0.1315466 1.9452047 0.0517504 0.9988859 1.6732506
year2016:boroughStaten Island 0.8644796 0.1431569 -1.0172580 0.3090307 0.6528959 1.1446386
year2017:boroughStaten Island 0.9866068 0.1425223 -0.0946074 0.9246267 0.7462658 1.3050651
year2018:boroughStaten Island 0.8321089 0.1483069 -1.2392683 0.2152461 0.6218356 1.1124636
year2019:boroughStaten Island 0.7854488 0.1468059 -1.6450292 0.0999638 0.5887784 1.0471310
year2020:boroughStaten Island 0.8911374 0.1789898 -0.6439284 0.5196219 0.6254889 1.2624986
year2021:boroughStaten Island 0.7210547 0.1929227 -1.6951883 0.0900397 0.4913490 1.0477969
month2:spay_neuterNeutered 1.0975279 0.1046014 0.8896655 0.3736455 0.8942084 1.3475758
month3:spay_neuterNeutered 1.0590365 0.1038588 0.5522843 0.5807536 0.8640731 1.2983755
month4:spay_neuterNeutered 1.3564416 0.1010573 3.0167522 0.0025550 1.1130392 1.6541988
month5:spay_neuterNeutered 1.3064944 0.0998785 2.6767279 0.0074345 1.0745154 1.5895806
month6:spay_neuterNeutered 1.2478170 0.0984299 2.2492726 0.0244952 1.0291517 1.5138497
month7:spay_neuterNeutered 1.0553692 0.0998428 0.5397557 0.5893656 0.8679186 1.2837758
month8:spay_neuterNeutered 1.2270280 0.0991555 2.0633748 0.0390770 1.0105739 1.4907576
month9:spay_neuterNeutered 1.2860456 0.1009214 2.4927527 0.0126757 1.0555295 1.5678926
month10:spay_neuterNeutered 1.0214410 0.1025915 0.2067852 0.8361776 0.8354704 1.2491730
month11:spay_neuterNeutered 1.1718216 0.1024908 1.5470612 0.1218485 0.9587419 1.4329166
month12:spay_neuterNeutered 1.0804447 0.1049678 0.7371092 0.4610559 0.8795844 1.3274456
genderM:spay_neuterNeutered 0.8800375 0.0453079 -2.8204976 0.0047949 0.8052402 0.9617512
genderM:boroughBrooklyn 1.1720073 0.0714328 2.2219198 0.0262887 1.0187251 1.3479804
genderM:boroughManhattan 1.1329946 0.0725463 1.7211669 0.0852205 0.9826828 1.3059767
genderM:boroughQueens 1.1693112 0.0684173 2.2861904 0.0222431 1.0222794 1.3367837
genderM:boroughStaten Island 0.9069992 0.0917203 -1.0642550 0.2872132 0.7582574 1.0864392
spay_neuterNeutered:boroughBrooklyn 2.4562537 0.0693940 12.9497913 0.0000000 2.1451226 2.8158263
spay_neuterNeutered:boroughManhattan 3.1397417 0.0704595 16.2382769 0.0000000 2.7364237 3.6070334
spay_neuterNeutered:boroughQueens 2.6530410 0.0670062 14.5614399 0.0000000 2.3279835 3.0273936
spay_neuterNeutered:boroughStaten Island 2.7436374 0.0868963 11.6148176 0.0000000 2.3149684 3.2546430

Evaluation

Additionally, the quality of the fitted model can be assessed through residual analysis and diagnostics, as illustrated in the following four plots.

  • Residual vs. Fitted Plot:

This graph helps to detect non-linearity, unequal error variances, and outliers. Ideally, the residuals should be randomly dispersed around the horizontal axis, meaning the red dashed line (a smooth curve) should be close to a horizontal straight line.

  • Q-Q Plot:

This plot is used to check whether the residuals conform to the distribution assumed by the model.

  • Scale-Location Plot:

This plot shows the spread of residuals versus fitted values and helps to check for homoscedasticity (constant spread of residuals). A model that meets this assumption will show a horizontal line with randomly spread points.

  • Residuals vs. Leverage Plot:

This plot is critical for identifying observations with high leverage values—points that have a significant impact on the model’s fit. Ideally, no points should exceed the threshold set by Cook’s distance, indicated by the dashed lines.

residuals_sim = simulateResiduals(model_stepwise)
plot(model_stepwise)

From the analysis of the four diagnostic plots, the fitted model exhibits several shortcomings. Firstly, the tails of the residual distribution slightly deviate from the assumed negative binomial distribution. This deviation suggests that the residuals might not be perfectly modeled, indicating potential issues with the model fit or the distributional assumptions. Secondly, the distribution of points in the Scale-Location plot appears striated, and the red smooth line shows a non-horizontal trend. This pattern indicates that the variance of the residuals may change with fitted values, suggesting the presence of heteroscedasticity. This non-constant variance can affect the reliability of the model’s standard errors and confidence intervals.Lastly, the Residuals vs Leverage Plot reveals the presence of points with high leverage, which could be overly influencing the model’s predictions. These high leverage points warrant further investigation to determine whether they should be removed or adjusted to improve model accuracy and robustness.

Prediction

Cross-validation of the final model is depicted in the following illustration, where the red dashed line represents the ideal prediction scenario (i.e., predicted values equal actual values). If points fall near this line, it indicates that the model predicts well. Points located above the diagonal line suggest that the model underestimates the actual values; conversely, if points are below the diagonal line, the model overestimates the actual values.

# Predicted vs Actual Bite Count
inter_predict = bites_by_group |> 
  mutate(predicted = predict(model_stepwise, type = "response"))

ggplot(inter_predict, aes(x = predicted, y = bite_count)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Predicted vs. Actual Bite Count",
       x = "Predicted Bite Count",
       y = "Actual Bite Count") +
  theme_minimal()

Overall, while the scatter plot shows a slight systematic deviation, the final negative binomial regression model still accurately predicts low and medium value areas. However, the error is larger in high-value areas, where actual values tend to fall below the predicted values. This may indicate a tendency of the model to overfit in regions with high incident counts.

Generalized Linear Mixed Model (GLMM)

To address the model fitting issue, a generalized linear mixed model is employed. The Generalized Linear Mixed Model (GLMM) is an extension of the Generalized Linear Model (GLM) and the Linear Mixed Model (LMM). GLMM consists of three main components: random effects, fixed effects, and a link function. Fixed effects represent the overall level parameters consistent across all observations, while random effects account for individual variability or group-level differences that may influence the response variable. The link function connects the linear predictor to the mean of the response variable distribution, allowing the dependent variable to deviate from a normal distribution and enabling modeling of various types of data, including binary, count, and continuous outcomes.

Let the response variable \(y\) belongs to the exponential family, its conditional distribution can be expressed as follows:

\[ f(y|b) = \prod_{i=1}^n f(y_i|\eta_i) \]

Where \({\eta _i} = {x_i}^ \top \beta + {z_i}^ \top b\). Additionally, the random effect \(b\) is assumed to be distributed according to \(b \sim {\rm N}(0,\Sigma )\).

Therefore the GLMM is structured as follows:

\[ g(\mathbf{E}(y)) = \mathbf{X}\vec{\beta} + \mathbf{Z}\vec{b} + \varepsilon \]

In the model, \(g( \cdot )\) represents the link function, which connects the expected value of the response variable to the linear predictor. Here, the logarithm function \(g(\mu ) = \ln (\mu )\) is used, which is appropriate for the Poisson distribution. \(\vec{\mathrm{E}}(y)\) denotes the vector of expected values for the response variable, and \(\varepsilon\) represents the error term, capturing random noise or the unexplained variance. \({\rm X}\vec \beta + {\rm Z}\vec b\) represents the linear combination of fixed and random effects, \({\rm X}\vec \beta\) is used to explain global trends, and \({\rm Z}\vec b\) is utilized to capture differences among observations. In this context, \({\rm X}\) is the design matrix for fixed effects, which is of order \(n \times p\), where \(n\) is the number of observations and \(p\) is the number of fixed effect variables; \(\vec \beta\) is the vector of coefficients for fixed effects, with each element representing the regression coefficient of a fixed effect variable. Similarly, \({\rm Z}\) also serves as the design matrix for random effects, of order \(n \times q\), where \(q\) is the number of random effect variables. \(\vec b\) is the vector of coefficients for random effects, representing inter-group deviations used to explain random differences between groups. The model assumes \(b \sim {\rm N}(0,\Sigma )\), thereby the random effects are normally distributed with a mean of \(0\) and a covariance matrix \(\Sigma\).

For the dog bite incidents dataset, the response variable \(y\) represents the number of dog bite incidents per zipcode per month. Fixed effects in the model include year, month, gender, and sterilization status. Random effects are attributed to different boroughs (5 boroughs in total). Thus, for all response variables, \({\rm X}\) is a matrix containing year, month, gender, and sterilization status; \(\vec \beta\) represents the coefficients for the fixed effects; \({\rm X}\) indicates the matrix for the borough each response variable belongs to; and \(\vec b\) denotes the deviation for each borough (random effects). This setup allows the model to adjust for the influence of both environmental conditions and borough-specific variations in the occurrence of dog bites.

Results

glmm_model = glmer.nb(
  bite_count ~ year + month + gender + spay_neuter + (1 | borough), 
  data = bites_by_group
)
summary(glmm_model)
glmm_model_summary <-
  broom.mixed::tidy(glmm_model) %>%
  mutate(
    significance = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      TRUE ~ ""
    )
  ) |> 
  dplyr::select(-effect, -group)

kable(
  glmm_model_summary,
  col.names = c("Term", "Estimate", "SE", "Statistic", "P-value", "Significance"),
  caption = "Summary of Negative Binomial Regression Results"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  scroll_box(height = "400px")
Summary of Negative Binomial Regression Results
Term Estimate SE Statistic P-value Significance
(Intercept) 1.1804665 0.1699854 6.9445164 0.0000000 ***
year2016 0.0698070 0.0416932 1.6743026 0.0940711
year2017 0.1535296 0.0409953 3.7450517 0.0001804 ***
year2018 0.0752759 0.0416131 1.8089477 0.0704591
year2019 0.0941456 0.0414242 2.2727192 0.0230431
year2020 -0.2788938 0.0455420 -6.1238848 0.0000000 ***
year2021 -0.2868691 0.0459453 -6.2437123 0.0000000 ***
month2 0.0024859 0.0588592 0.0422342 0.9663120
month3 0.0299497 0.0584208 0.5126548 0.6081928
month4 0.1274342 0.0572598 2.2255456 0.0260446
month5 0.1806481 0.0567229 3.1847452 0.0014488 **
month6 0.2446528 0.0560720 4.3631934 0.0000128 ***
month7 0.2018245 0.0566339 3.5636679 0.0003657 ***
month8 0.2276003 0.0564774 4.0299325 0.0000558 ***
month9 0.1429054 0.0572333 2.4968912 0.0125287
month10 0.0911307 0.0578934 1.5741114 0.1154617
month11 0.0889304 0.0579177 1.5354601 0.1246709
month12 0.0184990 0.0590100 0.3134894 0.7539089
genderM 0.8848816 0.0245647 36.0225275 0.0000000 ***
spay_neuterNeutered -0.1123005 0.0229100 -4.9018117 0.0000009 ***
sd__(Intercept) 0.3600968 NA NA NA

Random Effects:

Borough: Variance: 0.1297, standard Deviation: 0.3601. Indicates moderate variability between boroughs in the baseline bite count.

Fixed Effects:

  • Year: 2017 has positive coefficient (0.15336, p < 0.001), suggesting a higher bite count compared to 2015. While 2020 and 2021 have strong negative coefficients (-0.27884 and -0.28669, p < 0.001), indicating a substantial reduction in bite counts. This could relate to external factors like the COVID-19 pandemic.
  • Month: Positive coefficient (0.12734, p = 0.026), suggesting higher bite counts in April. In May, June and July, all p < 0.001, maybe there is a seasonal peak during summer.
  • Gender: Male dogs have strong positive coefficient (0.88482, p < 0.001), indicating male dogs have a significantly higher bite count compared to female dogs.
  • Spay/Neuter Status: Dogs who are neutered have have significantly lower bite counts (-0.11231, p < 0.001).

Evaluation

residuals_glmm_sim = simulateResiduals(glmm_model)
plot(residuals_glmm_sim)

Q-Q Plot:

  • Kolmogorov-Smirnov (KS) Test: p-value = 0.05845: Indicates that the residual distribution is not significantly different from the expected distribution (n.s. means not significant). The model fits the overall data distribution well.
  • Dispersion Test: p-value = 0.896: Suggests no evidence of overdispersion or underdispersion in the residuals. This confirms the appropriateness of using the negative binomial model for handling count data with dispersion.
  • Outlier Test: p-value = 0.00056: Indicates significant deviations due to outliers. This suggests that there are specific data points that the model fails to fit well.

Residuals vs. Predicted:

  • The red lines (quantile deviations) highlight significant deviations at certain ranges of predictions, particularly near the lower predicted values (around the 0.25 quantile). This indicates a slight mismatch between the model predictions and observed data for certain ranges.
  • Despite the overall good fit, systematic patterns (e.g., slight curvature) in the residuals suggest possible room for improvement in model specification.

Model comparison

model_comparison <- data.frame(
  Model = c("model", "filter_model", "glmm_model"),
  AIC = c(7403.4, 7045.8, 7428.7),
  Null_Deviance = c(3914.5, 4944.9, NA),
  Residual_Deviance = c(1549.7, 1437.1, 1551),
  Random_Effects = c("None", "None", "Borough"),
  Description = c(
    "Negative binomial regression with main effects only",
    "Negative binomial regression with main effects and interaction terms",
    "Negative binomial mixed model with random intercept for Borough"
  )
)

model_comparison %>%
  kbl(
    caption = "Model Comparison Table",
    align = "lcccccc",  
    col.names = c("Model", "AIC", "Null Deviance", 
                  "Residual Deviance", "Random Effects", "Description")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
    full_width = TRUE
  ) 
Model Comparison Table
Model AIC Null Deviance Residual Deviance Random Effects Description
model 7403.4 3914.5 1549.7 None Negative binomial regression with main effects only
filter_model 7045.8 4944.9 1437.1 None Negative binomial regression with main effects and interaction terms
glmm_model 7428.7 NA 1551.0 Borough Negative binomial mixed model with random intercept for Borough

Performance:

The filter_model has the lowest AIC and Residual Deviance, suggesting the best fit among the three. However, its lack of random effects might overlook some unobserved variability (e.g., between boroughs).

Complexity:

glmm_model introduces random effects, which makes it more flexible and suitable for accounting for regional differences, but this increases its AIC and Residual Deviance slightly.

Interpretability:

model is the simplest and easiest to interpret but underperforms compared to the other two models.

Residual Diagnosis:

model VS glmm_model: The KS test and dispersion test results in both models are similar, suggesting both models handle overall distribution and dispersion well. However, the outlier test is significant in glmm model, whlie the outlier test is insignificant in the model, indicating that this alternative model has difficulty capturing certain extreme values or specific subsets of the data.