In the assessment of car insurance claims, the claim rate for car insurance presents a highly skewed probability distribution, which is typically modeled using Tweedie distribution. The traditional approach to obtaining the Tweedie regression model involves training on a centralized dataset, when the data is provided by multiple parties, training a privacypreserving Tweedie regression model without exchanging raw data becomes a challenge. To address this issue, this study introduces a novel vertical federated learningbased Tweedie regression algorithm for multiparty auto insurance rate setting in data silos. The algorithm can keep sensitive data locally and uses privacypreserving techniques to achieve intersection operations between the two parties holding the data. After determining which entities are shared, the participants train the model locally using the shared entity data to obtain the local generalized linear model intermediate parameters. The homomorphic encryption algorithms are introduced to interact with and update the model intermediate parameters to collaboratively complete the joint training of the car insurance ratesetting model. Performance tests on two publicly available datasets show that the proposed federated Tweedie regression algorithm can effectively generate Tweedie regression models that leverage the value of data from both parties without exchanging data. The assessment results of the scheme approach those of the Tweedie regression model learned from centralized data, and outperform the Tweedie regression model learned independently by a single party.
In recent years, there has been a growing interest in the analysis of vehicle insurance data. Currently, many property and casualty insurance companies face a high combined cost ratio, with motor insurance accounting for a significant portion of the overall costs. In this context, usagebased insurance (UBI) for vehicles has emerged as a competitive product in the commercial vehicle insurance market. UBI premiums are determined based on specific vehicle usage behavior and the corresponding level of risk. Insurers collect data during the underwriting cycle to extract appropriate risk type parameters for different driving behaviors and habits of insured vehicles. These parameters are then used to adjust the traditional commercial vehicle insurance premiums for the next cycle, ultimately determining differentiated premiums for the insured vehicles. However, there is currently no clear standard for the differentiated premium adjustment mechanism of vehicle UBI products. It can only judge the risk type for a specific type of driving parameter (e.g., mileage, driving speed), or use multiple driving parameters to determine the comprehensive risk type [
In the motor insurance industry, there are numerous individual risks that require classification according to their characteristics and determining rates for each risk category based on the classification. The development of riskbased rate setting models for motor insurance can be divided into three stages: Initial rate setting models, the popularity of generalized linear models (GLM), and the emergence of extended classes. Early actuarial models for motor insurance rate setting used additive and multiplicative models, with the former assuming an additive relationship between rate factors and the latter assuming a multiplicative relationship. Since the late 20th century, GLMs [
The applications of GLMs in the car insurance field include risk assessment, claims prediction, premium pricing, and loss fitting. These applications can help car insurance companies better manage and control risks, improve business efficiency, and profitability. Therefore, the development of GLMs in the car insurance field provides more accurate and reliable modeling tools for insurance companies.
Traditional motor insurance pricing is only related to fixed factors such as age, gender, mileage and price of the vehicle. In practice, however, there are also dynamic data on users and vehicles that affect motor insurance pricing. In the auto insurance claims process, insurance companies have an urgent need for external data due to the low understanding of personnel information and the low quality of information collection. Insurers are therefore beginning to work with external data vendors to fuse internal and external data and develop motor insurance risk control models using machine learning algorithms.
Risk control models are statistical models that are used to estimate the risk associated with an event or situation. In the context of car insurance, risk control models can be used to predict the likelihood of a claim and determine an appropriate premium. These models are often based on various factors such as driver age, driving record, vehicle make and model, and geographical location.
In auto insurance risk control scenario, joint modelling refers to a modelling project in which an insurer and an external data vendor collaborate to provide samples with risk performance to the data vendor, match the feature data to develop a model, and then access the model to make a risk strategy. With the tightening of regulations on personal data privacy and the increasing reliance of insurers on external data, joint modelling is also gaining importance.
However, in recent years, countries around the world have increasingly attached importance to data privacy protection, and laws and regulations for privacy protection have been introduced successively [
To overcome the challenges brought by data privacy protection, many new technologies and algorithms have emerged, such as federated learning and homomorphic encryption. Federated learning (FL) [
Federated learning is widely used in scenarios that require data privacy protection, such as healthcare, financial services, and military fields. In regression problems, federated learning can be used to predict numerical target variables, such as predicting stock prices or disease incidence rates [
To address the above issues, a Tweedie generalized linear regressionbased joint modelling scheme for federal learning car insurance rate setting is proposed. The scheme considers the joint modelling of car insurance rate setting while taking into account the privacy protection of user and vehicle data. All sensitive data is stored in the local institution to which the data belongs, and encryptionbased user ID alignment is used to ensure that the participants align the common user sample without the flow of raw data. The experimental results show that the scheme has good results for the quantitative analysis of car insurance pricing variables and user risks.
Federated learning is essentially a cryptographic distributed machine learning framework that enables data sharing and joint modelling on the basis of data privacy and security and legal compliance. The core idea is that when multiple data sources participate in model training, only the intermediate parameters of the model are interacted with for joint model training without the need for raw data flow, and the raw data can be kept local. This approach achieves a balance between data privacy protection and data sharing and analysis, i.e., a “data available but not visible” data application model.
Vertical federated learning, i.e., samplealigned federated learning, is suitable for scenarios where there is a large overlap in user space between participants and little or no overlap in feature space,as shown in
The mainstream federal learning frameworks currently available include FATE (Federated AI Technology Enabler) by WeBank, PySyft by OpenMined, PaddleFL (Paddle Federated Learning) by Baidu, FedMl by USC, and TFF (TensorFlow Federated) by Google [
PySyft separates private data from model training using federation learning, differential privacy and cryptographic computation in major deep learning frameworks such as PyTorch and TensorFlow. PaddleFL is an open source federal learning framework based on PaddleFL, offering many federal learning strategies and their applications in computer vision, natural language processing, recommendation. FedML is an open research library and benchmark that facilitates the development of new federated learning algorithms and fair performance comparisons, supporting three computational paradigms (distributed training, mobile training and standalone simulation) for users to experiment in different system environments. TFF is mainly used for horizontal federal learning scenarios, especially for Android mobile devices. With TFF, developers are able to train shared global models across multiple participating clients.
FATE is an open source project initiated by the AI division of WeBank, the world’s first industrialgrade federation learning framework, providing a reliable and secure computing framework for the federation learning ecosystem. By the end of 2021, more than 1,000 companies and 200 research institutions have participated in the FATE open source ecosystem, with a large number of mainstream participants, contributors and major community contributors. the FATE project uses multiparty secure computing (MPC) [
Tweedielike distributions were first introduced in 1984 by Tweedie, a statistician at the University of Liverpool, UK, and later named by Smyth et al. [
The Tweedie distribution is a special case of an exponential dispersion model (EDM) with a power parameter p characterized by the following power relationship between the mean and variance of the distribution, where
The power parameter
Tweedie EDMs  

Normal  0  1  
Poisson  1  1  
Poissongamma  −  
Gamma  2 
Explanation of parameters:
Given that it is a composite distribution, a random variable can be described as:
where
where
The generalized linear model (GLM), first proposed by McCulloch [
Homomorphic encryption was first proposed by Rivest et al. [
This work is concerned with additive semihomomomorphic encryption, e.g., the Paillier encryption algorithm is a classical additive semihomomomorphic encryption algorithm and has been used in common federated learning algorithms. During the initialisation phase, the Paillier encryption algorithm generates the key pair
Encryption:
Decryption:
Homomorphic addition:
Scalar addition:
Scalar multiplication:
Through analysis of the data, this modelling applies to vertical federal learning, for which a system oriented towards vertical federal learning was created between the insurance company (generally referred to as Company A) and the data company (generally referred to as Company B), with the system architecture shown in
The training process for vertical federation learning generally consists of two parts. The first part is cryptographic entity alignment, where the data of Company A and Company B are stored in their respective systems and the original data are not exchanged. The system uses an encryptionbased user ID alignment technique to ensure that Parties A and B can align common users without exposing their respective original data. During entity alignment, the system does not expose users belonging to a particular company. The second part is the cryptographic model training phase, where the parties can use the data from these shared entities to collaboratively train a machine learning model after the shared entities have been identified.
The proposed model consists of two participants, A and B, and one collaborator, C, working together to train the machine learning model, with each participant having a sample size of n. The work consists of the following main components:
Participant A, with a certain number of specific samples, each with a corresponding feature value
A and B each have their own machine learning model training servers,
According to
A and B input each sample
For the calculation of the loss function of the generalized linear regression of the Tweedie distribution, according to
The servers S1, S2 compute the losses of A, B and use homomorphic encryption to obtain:
Server S2 receives the parameters from S1 and calculates the overall loss, then we have:
Party A  Party B  Party C  

Step 1  Initializes 
Initializes 
Creates an encryption key, sends public key to A and B. 
Step 2  Compute 
Compute 

Step 3  Initializes 
Initializes 
Decrypt 
Step 4  Update 
Update 

Result 
Convergence or nonconvergence based on
Assuming that the loss function values
A and B are computed jointly by homomorphic encryption to obtain the respective
The steps of model training are summarized in four steps, the following are shown in
The training protocol shown in
Proof of protocol security: This work assumes that both parties are semihonest. If one party is malicious and tricks the system by falsifying its input, e.g., if Party A submits only a nonzero input and a nonzero feature, it can determine the value of
Party A  Party B  Party C  

Step 1  Sends user ID i to A and B.  
Step 2  Compute 
Compute 
Gets result 
In this Section, we evaluate the convergence value of our solution for different values of power and the time overhead for different size quantities through experiments. We also experimentally compare the evaluation results of our solution with those of the standalone solution.
The experiments are executed in a LAN environment based on the FATE vertical federated learning framework, running on an AMD Ryzen 7 5800H 3.20 Ghz CPU processor with 8 cores and 16 threads and 32 G DDR 4 RAM, in a 64bit CentOS 7.3 environment with FATE version 1.8. The Tweedie regression model was trained using Python language and the Numpy library.
We evaluated the performance of the Tweedie regression federated learning model using two datasets from the financial insurance field.
The freMTPL2freq dataset is a French automobile thirdparty liability claims dataset, containing 677,991 samples of thirdparty liability insurance policies, each sample consisting of 10dimensional attribute features and one label. The attribute features include policy holder characteristics (age, gender, etc.), vehicle characteristics (make, model, etc.), and claimrelated information (time, location, etc.).
The CarData dataset comes from a publicly available set of insurance policy claims data on car insurance in de Jong et al. [
To verify the effectiveness of the FLTRM (Tweedie Regression Federated Learning Model) method proposed experimental comparisons will be conducted with three other methods.
The experimental settings for LocalATRM and LocalBTRM involve training the Tweedie regression model only on the local data of participant A and participant B, respectively. The purpose of this is to test the effectiveness of the Tweedie regression model under nonfederated settings and verify the effectiveness of federated learning. The NoFLTRM experimental setting involves training the model on the entire dataset after aggregating all the attribute features, which represents the traditional Tweedie regression method. The purpose of this is to compare its performance with the federated learning framework and evaluate the accuracy loss of the models trained under federated settings.
The freMTPL2freq dataset is partitioned into attribute features of 10 dimensions, which are split between participant A and participant B according to the ratios of 2:8, 3:7, 4:6, and 5:5. The label feature y is assigned to participant A, who serves as the active participant, while participant B serves as the collaborative participant. The FLTRM model will be trained using vertical federated learning with the joint participation of both participants A and B.
The experiments are conducted with L1 regularization and a penalty factor of
Feature partition ratio  Model  MAE  RMSE 

2:8  LocalATRM  176.1543  6991.6499 
LocalBTRM  171.3418  6990.5048  
NoFLTRM  171.7904  6990.1939  
FLTRM  174.5564  6991.4738  
3:7  LocalATRM  173.0040  6990.5731 
LocalBTRM  171.8653  6990.5259  
NoFLTRM  171.7904  6990.1939  
FLTRM  172.8734  6990.8932  
4:6  LocalATRM  172.9895  6990.5621 
LocalBTRM  172.1216  6990.5380  
NoFLTRM  171.7904  6990.1939  
FLTRM  172.1702  6990.5371  
5:5  LocalATRM  172.3342  6990.5543 
LocalBTRM  172.1376  6990.5422  
NoFLTRM  171.7904  6990.1939  
FLTRM  171.8976  6990.2972 
MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) are two evaluation metrics for regression models where lower values indicate better performance.
From
In general, models trained on more data tend to perform better than models trained on less data. However, the contribution of participants’ models to evaluation results depends not only on the amount of data they have but also on many other factors such as data quality, model and hyperparameter selection, and how well their data represents the overall sample.
FLTRM failed to learn effectively due to the extremely unbalanced feature segmentation ratio. This experiment also suggests that the difference in the number of features between the participating parties in federated learning should not be too large.
On the CarData dataset, we conducted experiments with two participating parties. The feature split ratio of the dataset was 4:3, which means that for each sample in the dataset, 4 out of 7 attributes were allocated to participating Party A as the collaborator, while the remaining 3 attributes and the label y were allocated to Party B as the active party. The FLTRM model was trained through vertical federated learning with the joint participation of Parties A and B. The experimental results are shown in
Model  MAE  RMSE 

LocalATRM  253.2908  1079.7570 
LocalBTRM  253.3144  1079.7959 
NoFLTRM  241.0070  1062.6478 
FLTRM  245.4078  1071.8824 
Based on
In addition to evaluating the model using MAE and RMSE, the risk coefficient
NoFLTRM  FLTRM  

Score  Count  Mean value  Count  Mean value 
1  6886  89.2906  6554  99.2172 
2  6710  107.3694  6553  103.6351 
3  6824  124.6016  6555  106.0244 
4  10336  126.0966  6551  128.5415 
5  13475  125.1362  9830  140.8193 
6  10066  132.7357  13107  149.3325 
7  3381  200.1981  9830  138.6016 
8  3398  201.7028  3277  173.6400 
9  3395  184.9875  3277  155.1071 
10  3385  240.1961  6553  253.4401 
In this work, we propose a federated learningbased Tweedie regression algorithm for constructing a joint assessment model for multiparty auto insurance rate setting in data silos. The algorithm derives the logarithmic natural formula of the vertical federated Tweedie regression model using an iterative method and constructs the gradient updating strategy of the parameters based on the loss function, introducing homomorphic encryption algorithm to achieve fusion updates of parameters from all parties and obtain the federated Tweedie regression model. The experiments on two datasets demonstrate that federated learning can be used for model training using the datasets of all parties while protecting data privacy. Furthermore, the model testing results prove that the federated learning model performs better than the singleparty trained models. In the auto insurance dataset with tag features following Tweedie distribution, the proposed model achieves good results in setting auto insurance rates. Future work will investigate the extension of the scheme to correlation structure data analysis and improve the accuracy and validity of data analysis by introducing random effects based on GLM.
I would like to express my heartfelt gratitude to all those who have contributed to the successful completion of this research work. First and foremost, I am deeply grateful to my supervisor, Professor Changgen Peng, whose guidance, support, and encouragement throughout the research process have been invaluable. Second, I would like to express my heartfelt gratitude to Professor Weijie Tan, whose expertise and insightful feedback have significantly improved the quality of this paper. I am also thankful to the members of my research committee, State Key Laboratory of Public Big Data, for their valuable suggestions and constructive criticism, which helped shape the direction of this study. I extend my appreciation to my colleagues and friends for their continuous support and for being a source of motivation during challenging times.
This research was funded by the National Natural Science Foundation of China (No. 62272124), the National Key Research and Development Program of China (No. 2022YFB2701401), Guizhou Province Science and Technology Plan Project (Grant Nos. Qiankehe Paltform Talent [2020]5017). The Research Project of Guizhou University for Talent Introduction (No. [2020]61), the Cultivation Project of Guizhou University (No. [2019]56), and the Open Fund of Key Laboratory of Advanced Manufacturing Technology, Ministry of Education (GZUAMT2021KF [01]).
The authors confirm contribution to the paper as follows: study conception and design: Tao Yin, Changgen Peng; data collection: Tao Yin, Hanlin Tang; analysis and interpretation of results: Tao Yin, Weijie Tan; draft manuscript preparation: Dequan Xu. All authors reviewed the results and approved the final version of the manuscript.
The data and materials used in this study are available upon request. Researchers and interested parties can obtain access to the datasets and any supplementary materials by contacting the corresponding author at
The authors declare that they have no conflicts of interest to report regarding the present study.