Skip to main content
Blog Post

Filling in the data gaps in insurance models

Insurance Consulting and Technology
Insurer Solutions

By Gary Wang | March 7, 2022

In a market where rich data is a valuable competitive edge, extracting information from ordered numeric and categorical data could be another step up for insurers.

Insurance modelers can benefit from a different encoding approach that enables modelers to retain order and proximity as part of the information when converting data. Let me clarify why we think this is so useful to a modeling team.

Naturally, insurance data is often ordered. Consider a common categorical variable such as protection class for homeowners insurance data. For protection class, while the data may be labelled one to 10, they are really names of levels. We do not actually expect the effect in comparing those in class three versus class one to be twice as pronounced as when comparing class two versus class one. We also don’t expect an a priori, linear relationship when reviewing the experience from class one to class 10.

This expectation is similar for hazard group in workers compensation data, which ranges from A to G. There is a certain progression as we consider data from group A working toward group G. But the strongest relation we expect with the response information is ordered in nature. Going into the modeling exercise, we do not expect a strong linear relation to hold.

One hot encoding

The traditional one hot encoding approach creates a set of binary variables and tags with a one matching the level of the original variable, while setting the other binary variables to zero. In the process, the model will treat fitting each level independently, and the default for the parameter is neutral at 1.0. This can have undesirable consequences in some instances; for example, when the model is showing a series of surcharges but comes across a level with low credibility. Dropping the level will revert the parameter to 1.0 and keeping the level will produce a volatile result depending on the experience.

At this point, typically the modeler steps in and manually chooses a neighboring level to group the level with. However, by taking a different encoding approach, we can make this the default choice. That is, in situations where the model finds a level uncredible, dropping it causes the resulting prediction to take on the same parameter value as the level next to it.

Such an approach has great utility to modelers. Even in situations where we have numeric data, such as age of policyholder or years the policy has been with the company, we expect a smooth behavior but need to wrestle with the shape of the curve or how to handle the tails. Consistency in approach can be achieved by converting these variables to categorical ordered variables and using this particular encoding approach as the starting point.

Potential tweaks

With a small tweak to the data processing, we can further convert the model structure for these variables to a piecewise linear fit, rather than a stepwise fit. This is a transformation that modelers sometimes use as a way to define the relationships between explanatory and response variables, and the encoding framework. While modelers are still expected to leverage their experience and expertise in molding models to the desired final structure, we believe this encoding approach provides more intuitive model fits for the modeler to work with.

Recognizing how this technique can overcome a significant shortcoming of traditional regression analysis and improve the accuracy of, and confidence in, the generalized linear models (GLM) used widely in insurance pricing and underwriting, WTW is working with Hirokaza Iwasawa, one of the authors of the 2021 Hachmeister Prize paper, “AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques.” In keeping with the authors’ nomenclature, we call this alternative encoding approach as producing O-dummy variables (O for ordinal), whereas the traditional one-hot encoding approach produces U-dummy variables (U for usual).

In a market where rich data is becoming a valuable competitive currency, extracting the full depth of information from ordered numeric and categorical data could be another step up for insurers that are looking for an all-important market edge.


Associate Director – Insurance Consulting and Technology

Contact Us