Insurers are sitting on an untapped goldmine: their text data. Through normal business operations, they collect vast quantities of unstructured text data including claim notes, loss control reports, customer feedback and more. Natural language processing (NLP), a fast growing subfield of artificial intelligence (AI), can help insurers harness this information and make the most of data that they already have.
Text data can provide nuance and fill gaps in standard structured data. For example, a commercial insurer might categorize a business according to its industry: service, retail, manufacturing, etc. This categorization is useful but limited. A written description of the company’s operations might provide additional information to differentiate this business and the risks it presents and faces. Quantifying this internal knowledge can add consistency to the way it is applied.
While insurers seem to recognize the potential in these text sources, realization of that potential has proven more challenging than expected as illustrated in Willis Towers Watson’s 2019/2020 P&C Advanced Analytics Survey. In both personal and commercial lines, the percentage of companies taking advantage of unstructured claim and underwriting information lagged expectations in 2019. However, insurers remained optimistic when looking ahead to 2021.
Use of nontraditional data sources
Personal | |||
---|---|---|---|
Expected for 2019 (in 2017) | Actual for 2019 | Expected for 2021 | |
Unstructured internal claim information | 66% | 38% | 69% |
Unstructured internal underwriting information | 50% | 18% | 67% |
Commercial | |||
---|---|---|---|
Expected for 2019 (in 2017) | Actual for 2019 | Expected for 2021 | |
Unstructured internal claim information | 91% | 53% | 81% |
Unstructured internal underwriting information | 63% | 16% | 66% |
In order to justify that optimism, insurers will need to overcome the challenges inherent to unstructured text data:
- “Junk” words like “the” and “to”
- Polysemy: words with more than one meaning
- Synonyms: words with the same meaning
- Negation
- Abbreviations, which often differ by industry and region
- The volume of text to analyze
However, with advances in AI, NLP is now capable of reading, deciphering, and understanding language and gaining valuable insights from it. In addition to processing and cleaning up text to make it more readable for machines, NLP provides various approaches to derive features that add value to predictive models and other analytical efforts.
Actuaries can look to this technology for a range of techniques, from simple to highly complex, which unlock this previously underutilized data source. Simple techniques include word or phrase counts. For example, how many times is the word “litigation” mentioned in a claim’s notes? More complex techniques include topic modeling, a clustering algorithm that attempts to understand the themes of a document. For example, claim’s notes might include topics like “surgery,” “slip and fall,” etc.
NLP features have proven valuable in both underwriting and claims.
- In underwriting, topic modeling performed on loss control reports can yield predictive features for frequency and severity models, such as information about a company’s safety programs or use of protective equipment.
- In claims, text features can capture information about litigation or expensive medical procedures.
As insurers increasingly turn to predictive models to support various business decisions, text features can serve as key explanatory variables in these models.
Internal text data remains a potential goldmine for insurers — one that does not require purchasing and validating data from a third-party vendor. While there are challenges, NLP provides actuaries and data scientists effective tools to utilize text data. As the field continues to innovate, insurers will have more and more techniques at their disposal to capitalize on this indispensable resource.
To find out more, join Liam and Yelena Kropivnitskaya’s presentation at the Casualty Actuarial Society’s Ratemaking, Product and Modeling Virtual Seminar on March 16, 10:30 a.m. ET.