Filling in the data gaps in insurance models

Insurance modelers can benefit from a different encoding approach that enables modelers to retain order and proximity as part of the information when converting data. Let me clarify why we think this is so useful to a modeling team.

Naturally, insurance data is often ordered. Consider a common categorical variable such as protection class for homeowners insurance data. For protection class, while the data may be labelled one to 10, they are really names of levels. We do not actually expect the effect in comparing those in class three versus class one to be twice as pronounced as when comparing class two versus class one. We also don’t expect an a priori, linear relationship when reviewing the experience from class one to class 10.

This expectation is similar for hazard group in workers compensation data, which ranges from A to G. There is a certain progression as we consider data from group A working toward group G. But the strongest relation we expect with the response information is ordered in nature. Going into the modeling exercise, we do not expect a strong linear relation to hold.

One hot encoding

The traditional one hot encoding approach creates a set of binary variables and tags with a one matching the level of the original variable, while setting the other binary variables to zero. In the process, the model will treat fitting each level independently, and the default for the parameter is neutral at 1.0. This can have undesirable consequences in some instances; for example, when the model is showing a series of surcharges but comes across a level with low credibility. Dropping the level will revert the parameter to 1.0 and keeping the level will produce a volatile result depending on the experience.

At this point, typically the modeler steps in and manually chooses a neighboring level to group the level with. However, by taking a different encoding approach, we can make this the default choice. That is, in situations where the model finds a level uncredible, dropping it causes the resulting prediction to take on the same parameter value as the level next to it.

Such an approach has great utility to modelers. Even in situations where we have numeric data, such as age of policyholder or years the policy has been with the company, we expect a smooth behavior but need to wrestle with the shape of the curve or how to handle the tails. Consistency in approach can be achieved by converting these variables to categorical ordered variables and using this particular encoding approach as the starting point.

Potential tweaks

With a small tweak to the data processing, we can further convert the model structure for these variables to a piecewise linear fit, rather than a stepwise fit. This is a transformation that modelers sometimes use as a way to define the relationships between explanatory and response variables, and the encoding framework. While modelers are still expected to leverage their experience and expertise in molding models to the desired final structure, we believe this encoding approach provides more intuitive model fits for the modeler to work with.

Recognizing how this technique can overcome a significant shortcoming of traditional regression analysis and improve the accuracy of, and confidence in, the generalized linear models (GLM) used widely in insurance pricing and underwriting, WTW is working with Hirokaza Iwasawa, one of the authors of the 2021 Hachmeister Prize paper, “AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques.” In keeping with the authors’ nomenclature, we call this alternative encoding approach as producing O-dummy variables (O for ordinal), whereas the traditional one-hot encoding approach produces U-dummy variables (U for usual).

In a market where rich data is becoming a valuable competitive currency, extracting the full depth of information from ordered numeric and categorical data could be another step up for insurers that are looking for an all-important market edge.

Author

Gary Wang

Associate Director – Insurance Consulting and Technology

email Email

Related Capabilities

List of website locations and languages available in Americas
Location	Languages Available
Argentina	Spanish
Bermuda	English
Brazil	Portuguese
Canada	English French
Chile	Spanish
Colombia	Spanish
Costa Rica	Spanish
El Salvador	Spanish
Guatemala	Spanish
Honduras	Spanish
Mexico	Spanish
Nicaragua	Spanish
Panama	Spanish
Peru	Spanish
United States	English
Venezuela	Spanish

List of website locations and languages available in Asia-Pacific
Location	Languages Available
Australia	English
China	Simplified Chinese
Hong Kong (China, SAR)	English
India	English
Indonesia	English
Japan	Japanese
Korea	Korean
Malaysia	English
New Zealand	English
Philippines	English
Singapore	English
Taiwan	Traditional Chinese
Thailand	English Thai
Vietnam	English

List of website locations and languages available in Europe
Location	Languages Available
Austria	German
Belgium	English French Flemish
Croatia	English Croatian
Czech Republic	English Czech
Denmark	Danish
Finland	Finnish
France	French
Germany	German
Greece	Greek
Hungary	Hungarian
Ireland	English
Italy	Italian
Kazakhstan	Kazakh Russian
Luxembourg	French
Netherlands	Dutch English
Norway	Norwegian
Poland	Polish
Portugal	Portuguese
Romania	Romanian
Serbia	Serbian
Slovakia	Slovak
Spain	Spanish
Sweden	English Swedish
Switzerland	English French German
Turkey	Turkish
Ukraine	Ukrainian
United Kingdom	English

List of website locations and languages available in Middle East and Africa
Location	Languages Available
Cameroon	English French
Congo	French
Egypt	English
Ghana	English
Ivory Coast	French
Israel	English
Jordan	English
Kenya	English
Kuwait	English
Mauritius	English
Nigeria	English
Saudi Arabia	English
Senegal	French
South Africa	English
UAE	English
Uganda	English

What can we help you find?

One hot encoding

Potential tweaks

Author

Related Capabilities