ABSTRACTOver the last decade, several banks have developed models to quantify credit risk. In addition to the monitoring of the credit portfolio, these models also help to decide the acceptance of new contracts, assess customers’ profitability and define pricing strategy. The objective of this paper is to improve the approach in credit risk modeling, namely in scoring models for predicting default events. To this end, we propose the development of a two-stage Ensemble Model that combines the results interpretability of the Scorecard with the predictive power of the Artificial Neural Network. The results show that the AUC improves 2.4% considering the Scorecard and 3.2% compared to the Artificial Neural Network. 1. INTRODUCTIONOver the last decade, several banks have developed models to quantify credit risk (Basel Committee on Banking Supervision, 1999). The objective of credit risk modeling is to estimate the expected loss (EL) associated with credit portfolio. To do so, it is necessary to estimate the Probability of Default (PD), the Loss Given Default (LGD) and the Exposure At the time of Default (EAD). The portfolio’s expected loss is given by the product of these three components (Basel Committee on Banking Supervision, 2004).However, this work focuses only on PD models, typically based on scoring models. Credit scoring models are built using historical information from several actual customers. For each one some attributes are recorded and whether the customer have failed to pay (defaulted). Specifically, credit scoring objective is to assign credit applicants to either good customers (non-default) or bad customers (default), therefore it lies in the domain of the classification problem (Anderson, 1978).Currently, credit scoring models are used by about 97% of banks that approve credit card applications (Brill, 1998). Using scoring models increase revenue by increasing volume, reducing the cost of credit analysis, enabling faster decisions, and monitoring credit risk over time (Brill, 1998). From the previous, credit risk measurement has become increasingly important in the Basel II capital accord (Basel Committee on Banking Supervision, 2003; Gestel et al., 2005).In banking industry, credit scorecard development has been based mostly in logistic regression. This happens due to the conciliation of predictive and interpretative power. Recall that regulators require that banks can explain the credit application decisions, thus transparency is fundamental on these models (Dong, Lai, & Yen, 2010; Hand & Henley, 1997). In this paper we propose a two-stage ensemble model to reinforce the predictive capacity of a scorecard without compromising its transparency and interpretability.2. LITERATURE SURVEYIn recent years, several attempts have been made to improve the accuracy of Logistic Regression (Lessmann, Baesens, Seow, & Thomas, 2015). Louzada et al. (2016) reviewed 187 (credit scoring) papers and concluded that the most common goal of researchers is the proposition of new methods in credit scoring (51.3%), mainly by using hybrid approaches (almost 20%), combined methods (almost 15%) and support vector machine along with neural networks (around 13%). The second most popular objective is the comparison of new methods with the traditional techniques, where the most used techniques are Logistic Regression (23%) and neural networks (21%). One of these studies was done by West (2000), that compared five neural network models with traditional techniques. The results indicated that neural network can improve the credit scoring accuracy ranging from 0.5% up to 3% and that logistic regression is a good alternative to the neural networks. In turn, Gonçalves and Gouvêa (2007) obtained very similar results using Logistic Regression and neural network models. However, the proposed new methods tend to require complex computing schemes and limit the interpretation of the results, which makes them difficult to implement (Liberati, Camillo, & Saporta, 2017).Lessmann et al. (2015) state that the accuracy differences between traditional methods and machine learning result from the fully-automatic modeling approach. Consequently, certain advanced classifiers do not require human intervention to predict significantly more accurately than simpler alternatives. Abdou and Pointon (2011) carried out a comprehensive review of 214 papers that involve credit scoring applications to conclude that there is no overall best statistical technique used in building scoring models, thus a best technique for all circumstances does not yet exist. This result is aligned with the Supervised Learning No-Free-Lunch (NFL) theorems (Wolpert, 2002).Marqués et al. (2012) evaluated the performance of seven individual prediction techniques when used as members of five different ensemble methods and concluded that C4.5 decision tree constitutes the best solution for most ensemble methods, closely followed by the Multilayer Perceptron neural network and Logistic Regression, whereas the nearest neighbor and the naive Bayes classifiers appear to be significantly the worst. Gestel et al. (2005) suggested the application of a gradual approach in which one starts with a simple Logistic Regression and improves it, using Support Vector Machines to combine good model readability with improved performance.3. THEORETICAL FRAMEWORK3.1. DATASETTo ensure that our results are replicable and comparable, we decided to use the German Credit Data Set from University of California at Irvine (UCI) Machine Learning Repository. The dataset can be found at http://archive.ics.uci.edu/ml/datasets.html. According to Louzada et al. (2016), almost 45% of all reviewed papers (in their survey) consider either Australian or German credit datasets. The data set contains 1000 in force credits, where 700 are identified as non-defaulted and 300 as defaulted. The 20 input variables prepared by Prof. Hofmann are presented on Table 1.The target variable is “status” and contains the classification of the loan in terms of default (Lichman, 2013).The data set comes with a recommended cost matrix, making a fail in predicting a default five times worse than failing to predict a non-default. however, given this paper’s objectives, we chose not to use any cost matrix. Thus, both failing to predict a default and a non-default have the same cost.3.2. TWO-STAGE ENSEMBLE MODELIn this paper, we aim to improve the approach used in credit scoring models. To this end, we propose a Two-Stage Ensemble Model (2SEM) to reinforce the predictive capacity of a Scorecard without compromising its transparency and interpretability.The concept behind the ensemble is to use several algorithms together to obtain a better performance than the one obtained by each of the algorithms individually (Rokach, 2010). In our paper, we will firstly estimate a Scorecard (SC) model and then an Artificial Neural Network (ANN) is estimated on the SC Residual. Then, we ensemble the two models using a logistic regression. This way, we pretend that the ANN covers for the nonlinearity that SC is unable to capture. The proposed architecture for the Ensemble Model is presented on Figure 1:Where is the set of inputs, the target variable, and are the target and residual estimates, respectively. The box operator stands for a specific algorithm (in this case, SC, ANN and LR) and the circle a sum operator (where the above sign corresponds to the above variable, and the other to the below variable). The components in Figure 1 are better described in Table 2.Lastly, to avoid overfitting the dataset was splitted randomly into training set (65%), validation set (15%) and test set (20%). In this process we used stratified sampling on the target variable to ensure the event proportion is similar in all sets. 3.3. PERFORMANCE METRICSFollowing Hamdy & Hussein (2016) performance assessment approach, we will use rely confusion matrix and Area Under the ROC curve (AUC) to compare the predictive quality of the 2SEM, SC and ANN.Confusion MatrixThe confusion matrix is a very is a very widespread concept, and it allows a more detailed analysis of the right and wrong predictions. As may be seen at Figure 2.4, there are two possible predictive classes and two actual classes as well. The combination of these classes originates four possible outcomes: True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN).These classifications have the following meaning:• True Positive: it includes the observations that we predict as default and are actually default;• False Positive: it includes the observations that we predict as default but are actually non-default – error type I;• True Negative: it includes the observations that we predict as non-default and are actually non-default;• False Negative: it includes the observations that we predict as non-default but are actually default – error type II;To ease up the matrix interpretation the following measures may be computed:From the previous, accuracy takes a central place. However, this metric must be used carefully, especially on unbalanced datasets (as the one we are using). For example, in a dataset with 5% event rate, then a unary prediction of non-event would have an accuracy of 95%, better than a stochastic model that could get 90% of the times correct in a dataset with 50% event rate. Clearly, this metric is not robust for comparisons between models applied on datasets with different event rate. However, we may use it to compare models on the same dataset, that is precisely what we want to do. Moreover, we will use the inverse metric, the Misclassification Rate.AUCOther measure for assessing predictive power is the Area Under Curve (AUC) Receiver Operating Characteristic (ROC). The curve is created by plotting the true positive rate against the false positive rate at various cutoff points. The true-positive rate is also known as sensitivity or probability of detection. The false-positive rate is also known as the probability of false alarm and can be calculated as (1 ? specificity). The AUC=0.5 (random predictor) is used as a baseline to see whether the model is useful or not (Provost & Fawcett, 2013).Compared to the confusion matrix, this method has the advantage of not requiring the cut-off definition (value from which the probability of default is high enough to consider that the customer is a bad one). Besides, it is also suited for unbalanced datasets (Hamdy & Hussein, 2016). However, the use of ROC Curve as unique misclassification criterion has decreased significantly in the articles over the years. More recently the use of metrics based on confusion matrix is most common (Louzada et al., 2016).4. RESULTS AND DISCUSSIONIn this section we firstly present the estimation results for both 2SEM and the baselines (SC and ANN). And then the results obtained are analyzed and compared to select the most appropriate model.4.1. SCORECARDPrior to scorecard estimation, some input variables had to be binned. This process consisted in grouping the input variable’s values that had similar event behavior (target variable). To cutoffs used maximized the Weight of Evidence (WOE), a metric for variable Information Value (IV) (Zeng, 2014). The binning outcome consisted in 20 new categorical input variables, that were then used in a stepwise selection algorithm. Thus, the following seven input variables were included in the scorecard: Age in years, Credit amount, Credit history, Duration in month, Purpose, Savings account/bonds and Status of existing CA. The estimates might be seen in Table 4.The score points in this scorecard increase as the event rate decreases. The estimation parametrization ensures that a score of 200 represents odds of 50 to 1 (that is P(Non-default)/P(Default)=50). The neutral score in a variable is 16 and an increase of 20 in the score points corresponds to twice the odds. The link between score points and probability of default is pictured in Figure 2.4.2. ARTIFICIAL NEURAL NETWORKSThe neural network was designed of five layers, input, three hidden, and output layers. The input layer has 20 variables while each hidden layer includes three neurons with Tanh activation function. So, we included 9 hidden neurons and estimated 208 weights. Figure 3 presents the Artificial Neural Network architecture.The optimization process ended on the 10th iteration, achieving an average validation error of 0.496, as presented on Figure 3.4.3. TWO-STAGE ENSEMBLE MODELThe 2SEM consists on a logistic regression using PD estimate from SC (P_Scard) and SC residual estimation from ANN (P_ANN) as inputs. We expect that the P_Scard accounts for the majority of 2SEM predictive power, while P_ANN is supposed to correct P_Scard deviations (prediction failures). The coefficients estimates are presented on Table 5As may be seen, the P_Scard is the main contributor to 2SEM (the P_Scard std estimate is twice the P_ANN), been both statistically significant.4.4. DISCUSSIONIn this section we compare the Scorecard, Artificial Neural Network and the Two-Stage Ensemble Model according to confusion matrix metrics and AUC. But before Figure 5 presents default rate distribution through scoring deciles. To obtain these distributions the test dataset was ascending sorted by target prediction (in each model) and divided in 10 equipopulated bins. Then the average of Status (DefRate) and Status Prediction (AvgProb) were computed. Analyzing these plots, we identify that none of the distributions is monotonic (what is usually a requirement in probability of default models), however there is an evolution in the right way from SC to 2SEM.We turn now to the fit statistics, presented in Table 6. The results indicate that the 2SEM has a better fit to data according to all these statistics. Namely, AUC improves 2.4% (0.019pp) considering the Scorecard and 3.2% (0.025pp) compared to the Artificial Neural Network.This result is reinforced by the ROC curve representation. In Figure 6 are presented the ROC curves for train, validate and test datasets.5. CONCLUSIONCredit scoring models attempt to measure the risk of a customer falling to pay back a loan based on his characteristics. In banking industry, the most popular model is the scorecard due to the conciliation of predictive and interpretative power. Recall that regulators require that banks can explain the credit application decisions, thus transparency is fundamental on these models. In this paper we propose a new ensemble framework for credit-scoring model to reinforce the predictive capacity of a scorecard without compromising its transparency and interpretability.The two-stage ensemble model consists on a logistic regression using PD estimate from Scorecard and Scorecard residual estimation (obtained through Artificial Neural Network) as inputs. Thus, the Scorecard estimate (PD) accounts for the majority of 2SEM predictive power, while the Artificial Neural Network aims to help correcting the Scorecard deviations (prediction failures). This ensemble framework may be seen as an estimation by layers, where modeling is done using more and more powerful methods from layer to layer. The advantage of this approach relates to the use of residuals as the target in the next layer. As the largest fit is obtained in the first layers the majority of the model components are produced by the simplest algorithms, preserving the interpretability of most of the prediction.Results indicate that the default rate distribution produced by the Scorecard is not monotonic (what is usually a requirement in probability of default models), however there is an evolution in the right way when considering the 2SEM. Furthermore, the AUC improves 2.4% (0.019pp) considering the Scorecard and 3.2% (0.025pp) compared to the Artificial Neural Network.Finally, several improvements are still to be done. Firstly, other algorithms and parametrizations may be tested to check if the second stage contribution may be improved. There is no hard evidence that the Artificial Neural Network used is the best fit. Secondly, a generalization of the ensemble architecture should be developed, turning the algorithm in a n-stage ensemble model. Finally, the results should be obtained also for other datasets, to ensure that they are not a lucky guess.