Propensity to claim in motor and home insurance

House with red front door and hanging baskets - Propensity to claim motor and house insurance

The Client

The client is a major player in the insurance industry.

The Challenge

To use Machine Learning on the client’s data in order to predict ‘propensity to claim’ in Motor and Home insurance.

The Constraints

The data required a considerable clean-up. Like all organisations, the client had a large number of data stores and some historic issues with its data collection. These would need to be amalgamated and cleaned up.

With the advent of the GDPR, there were some constraints on the breadth of personal data the client could share with us. To get around this we one-time encrypted the personal data details, such as the name and postcode etc, and the client kept the encryption keys. This allowed us to compare encrypted values of data (such as the postcodes) to determine whether they were the same, without actually knowing the values themselves.

The Method

Our approach involved the creation of several competing algorithms – Random Forest, Support Vector Machine and Stochastic Gradient Descent Classifier. The algorithms were then heavily adjusted in our lab to extract the best results from the data.

A good example of such an adjustment is our use of Natural Language Processing with the Occupation of the insured person. We analysed how often humans used occupations in a similar sentence such as “…so and so became ill and went to the <vet> >doctor> <nurse>… ”, which then allowed us to group occupations into clusters by linguistic similarity.

This in turn allowed us to group ‘Chauffeur’ and ‘Driver’ into the same occupation cluster,  therefore increasing our learning for both occupations.

We managed to scale down several thousand occupations to just 100 clusters, an approach which revealed significant results in terms of improving accuracy.

The Outcome

While contractual constraints prevent us from divulging too much detail, we were able to show clear and consistent capacity to predict a claim at considerably higher rates than the industry standard. The model was trained on 1.3 million records, then tested on 628K.

The client has chosen to proceed with putting the model live, with the work currently ongoing.