Exploration on Car Insurance DataUsing Supervised LearningAbstract — Insurance sector by nature has rigorouscollection of data. Insurance business data statistics are used to measure themenace we can archive this using big data analytics. Big Data Analytics is usedto find out features of potential client, allowing insurance business to higherforetelling accuracy. The key objective of this paper is to analyze andunderstanding the need and purchase plan to find who all buy the car insuranceservice of the campaign. So, we propose different classification algorithms inR on large-scale insurance data to improve the performance and predictivemodeling.
We have collected the data from Kaggle datasets:The Home of Data Science & Machine Learning. We use Confusionmatrix, Precision Recall and F-Measure to estimate the performance of thealgorithm. The final product shows that which algorithm outperformed than otherclassification algorithm in terms both accuracy and performance with insurancedata to predict who all buy the car insurance service. Keywords –Supervised learning, R tool, big data, Classification algorithms, samplingtechnique. I. INTRODUCTIONA huge amount of data is generally referred as Big Data.
It isenormous in size, diverse variety and has highest velocity of data arrival.This huge information is useless unless the data is examined to uncover the newcorrelations, customer experiences and other useful information that can helpthe organization to take more informed business decisions. Big data is widelyapplied in all sectors like healthcare, insurance, finance and many more. Big data in insurancesector is one of the most promising. Traditional marketing system of insuranceis offline based sales business. They generally sell the insurance policies bycalling and visiting the customers.
This fixed marketing system also achievedgood results in past time. But currently many new private insurances companiesalso have entered into the marketplace which gives healthier competition. Onother hand, eagerness of people to pay for the insurance service is alsoincreased. Therefore, understanding the need and purchase plan of clients isextremely essential for insurance companies to raise the sales volume. Big data technologysupports the insurance companies’ transformations. Due to lack of principle andinnovation of traditional marketing, badly structured insurance data, unclearcustomers purchasing characteristics leads to imbalanced data, which brings thedifficulty of classification of user and insurance product recommendation.Decision making task is difficult with imbalanced datadistribution.
To solve this problem, we usually use few resampling methodswhich will construct the balanced training datasets. This will improve theperformance of predictive model. Main purpose of thispaper is to identify the potential customer with help of big data technology.This paper does not only provide good strategy for identifying the potentialclient but also act as good reference for classification problems.We propose supervised learning algorithms call ensembledecision tree (Random forest and XGBoosting) to predict who buy the carinsurance service of following campaign.This paper is organized as follows. Section II introduces thecurrent research status of machine learning; Section III puts forward theclassification model and intelligent recommendation algorithm based on XGBoostalgorithm for insurance business data, and analyzes its efficiency; Section IV givesyou experiment and result. Section V puts forward the analysis result.
SectionVI Conclusion and future work. II. RELATED WORKThe classification problem for US bank insurance business datahas imbalanced data distribution. This means ratio between positive and negativeproportion are extremely unbalanced, the prediction models generated directlyby supervised learning algorithms like SVM, Logistic Regression are biased forlarge proportion. Example, the ratio between positive and negative classes is100:1. Therefore, this can be seen as such model does not help in prediction. Imbalanced classdistribution will affect the performance of classification problem.
Thus, sometechniques should be applied to deal this problem. One approach to handle theproblem of unbalanced class distribution is sampling techniques 2. This willrebalance the dataset. Sampling techniques are broadly classified into twotypes.
They are under sampling and over sampling. Under sampling technique isapplied to major class for reduction process (e.g.
Random Under Sampling) andover sampling is another technique applied to add missing scores to set ofsamples of minor class (e.g. Random Over Sampling).The drawback of ROS isredundancy in dataset this will again lead to classification problem that isclassifier may not recognize the minor class significantly. To overcome thisproblem, SMOTE (Synthetic Minority Over Sampling) is used. This will createadditional sample which are close and similar to near neighbors along withsamples of the minor class to rebalance the dataset with help of K-NearestNeighbors (KNN) 2.Sampling method is divided into non-heuristic method andheuristic method.
Non-heuristic will randomly remove the samples from majorityclass in order to reduce the degree of imbalance 10. Heuristic sampling isanother method which will distinguish samples based on nearest neighboralgorithm 7.Another difficulty in classification problem is data quality,which is existence of missing data. Frequent occurrence of missing data willgive biased result. Mostly, dataset attributes are dependent to each other.Thus, identifying the correlation between those attributes can be used todiscover the missing data values. One approach to replace the missing valueswith some probable values is called imputation 6.
One of the challenges in big data is data quality. We need toensure the quality of data otherwise it will mislead to wrong predictionssometimes. One significant problem of data quality is missing data.Imputation is method for handling the missing data. This willreconstruct the missing data with estimated ones.
Imputation method hasadvantage of handling missing data without help of learning algorithms and thiswill also allow the researcher to select the suitable imputation method forparticular circumstance 3. There are many imputation methods for missing value treatment(Some widely used data imputation methods are Case substitution, Mean and Modeimputation, Predictive model). In this paper we have built the predictive modelfor missing value treatment.There are a variety of machine learning algorithms to crackboth classification and regression problems. Machine learning is practice ofdesigning the classification which has capability to repeatedly learn and performwithout being explicitly programmed.
Machine learning algorithms are classifiedinto three types (Supervised learning, Unsupervised learning, Reinforcement Learning).In this paper, we propose supervised machinelearning algorithms to built the model. Some of the supervised learningalgorithms are listed below: Regression, DecisionTree, Random Forest, KNN, Logistic Regression etc 8. Decision tree in machinelearning can be used for both classification and regression.
In decision examination, a decision tree can be used to visually and unambiguouslyrepresent decision. The tree has two significant entities precisely knownas decision nodes and leaves. The leaves are the verdict or the final result.And the decision nodes are wherever the data is split. Classification tree istype of decision tree where the outcome was a variable like ‘fit’ or ‘unfit’.Here the decision variable is Categorical.One of the best ensemble methods is random forest.
It is usedfor both classification and regression 5. Random Forest is collection of manydecision trees; every tree has its full growth. And it has advantage ofautomatic feature selection and etc 4.Gradient Boosting looks to consecutively decrease fault witheach consecutive model, until one final model is produced.
The key intend of everymachine learning algorithms is to construct the strongest predictive modelwhile accounting for computational effectiveness on top. This is whereXGBoosting algorithm engages in recreation. XGBoost (eXtreme Gradient Boosting) is a direct application of GradientBoosting for decision trees.
It gives you further regularize modelformalization to manage over-fitting, which gives improved performance 8. III. METHODOLOGYClassification Model:Traditional sales approach of insurance product is offline process and it asfollowing disadvantages: (1) lack of customer evaluation system, don’t know thecharacteristics influence weight of the potential customers; (2) the dataaccumulated in this way usually has serious ruinous, indirect influence theaccuracy of classification model 4.For a bunch of classification models, distribution of classand correlation features affects the forecast results. Imbalanced dataclassification and independent attributes of insurance dataset will haveserious deviation in classification model result. We can handle this kind ofproblems with different sampling method and supervised learning algorithms.
In this article, we have used over sampling approach withsupervised learning algorithms on car insurance dataset to build the bestpredictive model. Imbalanced data classification problem is resolved with oversampling method and finally we build the model with supervised learningalgorithms using training dataset. Finally, predictive model is validated withtest dataset and Performance of algorithms is evaluated using confusion matrixmethod with test dataset. And Precision-Recall, F-measure is also otherperformance metrics calculated for accuracy of algorithms. The taxonomy ofproposed classification model is given below: Figure1.Taxonomy of proposed methodology A.
Dataset:The key objective of this paper is to analyze andunderstanding the need and purchase plan of find who all buy the car insuranceservice of campaign. So, we propose different classification algorithms in R onlarge-scale insurance data to improve the performance and predictive modeling.We have collected the data from Kaggle datasets: The Home of Data Science &Machine Learning.This dataset is collected from one of the bank in the US.
Inaddition to common services, this bank also provides car insurance services.This bank arranges promotions like campaign every year to catch the attentionof new customers. The bank has provided details about potential customers’data, and bank staff’ call duration time for promotion available car insurance decision.You have data regarding 4000 customers who were make contact with the lastcampaign and also the fallout of campaign that is did the customer buyinsurance product or not are known. B.
Preprocessing:Data is usually collected for unspecified applications. Dataquality is one of the major issues that are needed to be addressed in processof big data analytics. Problems that affect the data quality are given in thefollowing: 1.Noise and outliers 2.
Missing values 3. Duplicate data. Preprocessingis a method used to make data more appropriate for data analytics. DataCleaning is a process to handle the misplaced data. We have used analyticalmodel for imputation method to envisage the misplaced values using thenon-missing data.
Here, we used KNN algorithm to estimate the missing data.This will estimate missing data with help of the recent neighbor values. Data transformation is one of the methods in preprocessing to normalizedata. Normalization is a process inwhich we modify the complex dataset into simpler dataset. Here, we usedMin-Max Normalization to normalize the data. It will scale the data between the0 and 1.Where, x is thevector that we going to normalize. Then min and max arethe minimum and maximum values in x given its range.
Once thedataset is pre-processed. Now it is ready for data partition. C. Data Partition:In this step, we will split data into separate roles of learn(train) and test rather than just working with all of the data. Training datawill contain the input data with expected output.
More or less 60% of theoriginal dataset constitutes the training dataset and other 40% is consideredas testing dataset. This is the data that validate the core model and checksfor accuracy of the model. Here, we partitioned the original insurance datasetinto train and test set with probability of 60 and 40 split. D. Supervised Learning Algorithms:Supervised learning is machine learning technique.
This infersfunction and practice with training data without explicitly programmed.Learning is said to be supervised, when the desired outcome is already known. After partition, next step is to build the model with trainingsample set. Here, our target variable is chosen first. We selected our targetvariable as car insurance and other attributes in dataset is taken aspredictors to develop the predictive model. Now, I desire to make a modelto envisage who all buy the car insurance service during campaign? In thisproblem, we need to separate out clients who buy car insurance and who were notbuy the insurance in the campaign based on extremely considerable keyvariables.In this paper, we are using Random Forest and Extreme gradientboosting Algorithm to envisage the model. And evaluate which algorithm confersbetter performance.
Random Forest: Before we move to random forestit is necessary to have a look on decision tree. What is Decision Tree?Decision Tree can be used for both classification andregression. Unlike linear models, tree based predictive model gives highaccuracy. Decision tree is frequently used in classification problem.
It will separateout the clients based on predictor variables and identify the variable,which creates the best uniform sets of clients. In this, our decisionvariable is categorical.Why Random Forest?Random forest is one of the frequently used predictive modeland machine learning technique. In a normal decision tree, onedecision tree is built to identify the potential client but in case of randomforest algorithm, numbers of decision trees are built during the process toidentify the potential client.
A vote from each of the decision trees isconsidered in deciding the final class of an object.Model Description:Sampling is one of the methods in preprocessing. This willselect the subset of original samples. This is mainly used in case of balancethe data classification. In our model, we have used under sampling approachesto balance the data sampling.It will condense the majority group to make their occurrence closer to the infrequentgroup.
Original insurance data is balanced with undersampling. So further we will use this sample in Random Forest Algorithmsto build the model. This randomly generates the n number of trees to build theeffective model.ExtremeGradient Boosting:Another classifier is extreme gradient boosting. The XGBoosthas an immensely high predictive model.
This algorithm works ten percent fasterthan existing algorithms. It can be used for both regression, classificationand also ranking.One of the most interesting things about the XGBoost isregularized boosting method. This helps to lessen overfit modeling.
Over-fitting is the occurrence in which the learning modeltightly fits the given training data so much that it would be inaccurate inpredicting the outcome of the test data. Model Description:In our model, first we used over sampling method to balancethe classification. Sampling technique can be used to get better forecastperformance in the case of imbalanced classes using R and caret package. Oversampling will randomly duplicate samples from the class with fewinstances.
Here, we used over sampling method with train set to improve theperformance of model. Now, balanced samples are collected. We will pass thesesamples to XGBoost as train set and built the model. XGBoost built the binaryclassification model with insurance data. After this, model is validated withtest set. This produces much better prediction performance compared torandom forest algorithm. E.
Model Evaluation:Performance analysis of classification problems includes thematrix analysis of predicted result. In this paper, we have used followingmetrics to evaluate the performance of classification algorithms. They arePrecision-Recall, F-measure.Precision is the fraction of predicted occurrence that isrelated.
It is also called positive predicted value.Recall is part of related instances that have been repossessed over the total quantityof related instance. F1-Measure is the weighted harmonic mean (Number ofinterpretation, divided by the sum of reciprocals of the interpretation) of the precision and recall and correspond to theoverall performance.
Where, TP – Truepositive ,FP – False Positive, TN-True Negative, FN-False Negative. Table 1: Confusion Matrix IV. EXPERIMENT AND RESULT We used KNN for missing data treatment and after allpreprocessing we have built the predictive models with XGBoost and RandomForest for business case.
The comparison table for two models is given below. Table 2: Performance comparison of XGBoost and random forest. Algorithm Precision Recall F1 Accuracy Random Forest 0.81 80 0.
76 0.76 XGBoost 0.86 0.86 0.86 0.86 This above result shows that XGBoosting algorithm outperformed than random forest.
V. ANALYSIS OF THE RESULT Figure 1: Effect of Missing values before Imputation Figure 2: Important Features that impact on target variableusing Random Forest Algorithm. Figure 3: AUC curve for Random Forest Algorithm Figure 4: AUC curve for ExtremeGradient Boosting Algorithm Figure 5: Important Features that impact on target variableusing Gradient Boosting Algorithm.Figure 6: Overall Performance Analysis VI. CONCLUSIONThis paper analyzed the imbalance distribution of insurancebusiness data, concluded the preprocessing algorithms of imbalance dataset, We usedtwo Supervised learning algorithms based on R which can be used in the largescaled imbalanced classification of insurance business data. They are XGBoostalgorithm and Random Forest. Here, XGBoost algorithm out performed than otherdecision tree algorithm called Random Forest. Our future works includecombining proposed algorithm with deep learning.
References: 1. E. Ramentol, Y. Caballero, R. Bello, and F.
Herrera,“SMOTE-RSB:A hybrid preprocessing approach based on oversampling andundersampling for high imbalanced data-sets using SMOTE and rough setstheory,”Knowl. Inf. Syst., vol.
33, no. 2, pp. 245_265, 2012. 2.
Maryam Farajzadeh-Zanjani, Roozbeh Razavi-Far, MehrdadSaif,” Efficient Sampling Techniques for Ensemble Learning and DiagnosingBearing Defects under Class Imbalanced Condition”. 3. Gustavo E. A. P. A. Batista and Maria Carolina Monard,”An Analysis of Four Missing Data Treatment Methods for Supervised Learning”.
4. Weiwei Lin, Ziming Wu, Longxin Lin, Angzhan Wen, And JinLi,” An Ensemble Random Forest Algorithm for Insurance Big Data Analysis”,2017. 5. Eesha Goel, Er. Abhilasha,” Random Forest: AReview”,2017. 6.
ConceptionOf Data Preprocessing And Partitioning Procedure For Machine Learning.Avaliable:http://www.academia.
edu/9517738/conception_of_data_preprocessing_and_partitioning_procedure_for_machine_learning_algorithm. 7.Down-Sampling Using Random Forests, Avaliable:https://www.r-bloggers.
com/down-sampling-using-random-forests/ 8. Boosting in Machine Learning and the Implementation ofXGBoost Avaliable: https://towardsdatascience.com/boosting-in-machine-learning-and-the-implementation-of-xgboost-in-python.9. Tianqi Chen and Tong He,”xgboost: eXtreme GradientBoosting”, January 4, 2017.10. Jorma Laurikkala,”Improving Identification of Difficult Small Classes byBalancing Class Distribution”,2001.