The AI’s Origin: Alain COLMERAUER Welcome to ‘The AI’s Origin’ series....Read More
Artificial intelligence (AI) has become increasingly prevalent in various industries, including the banking sector. One potential application of AI in the banking industry is fraud detection. Fraud is a significant problem for banks, as it can result in financial losses and damage to the bank’s reputation. Traditional methods of fraud detection, such as manual reviews and rule-based systems, can be time-consuming and may not be effective at identifying more sophisticated forms of fraud. AI platforms like papAI, on the other hand, can help banks detect fraud more efficiently and accurately.
papAI platform can identify patterns and anomalies that may indicate fraudulent activity. This can include analyzing transactions for unusual spending patterns, detecting suspicious account activity, and identifying fake or altered documents. One way that banks can use papAI platform for fraud detection is through predictive models. These models can be trained on historical data to recognize patterns and behaviors associated with fraudulent activity. then be used to analyze new data and identify potential instances of fraud.
State of AI adopting in banking industry
The maturity of artificial intelligence (AI) in the banking sector varies widely across different organizations. Some banks have fully embraced AI and are using it in a range of areas, while others are just starting to explore its potential. One of the main factors that determines the maturity of AI in a bank is the level of investment in the technology. Banks that have invested heavily in AI are likely to be more advanced in their use of the technology. These banks may have implemented AI-powered chatbots and virtual assistants to improve customer service, and may be using machine learning algorithms to analyze financial data and detect and prevent fraud. However, the “Accenture levels of AI maturity by industry ranking” shows that banks are far behind industry leaders in their AI maturity.
For banking institutions, the development and deployment of AI solutions is far less developed. According to Accenture, only 1% of financial institutions can be considered AI Achievers. More concerning is the reality that 75% of financial institutions are still in the early experimental stage of AI development. This is noteworthy, since Achievers, Builders and Innovators tend to devote more technology, time and talent to delivering on AI visions and to transform their organizations.
Why banks should use papAI for fraud detection ?
papAI platform has the potential to transform the way that organizations detect and prevent fraudulent activity. Here are just a few of the benefits to using it for fraud detection:
- Weak signals detection: papAI platform is designed to identify patterns or anomalies in data that may indicate fraudulent activity, even if the signals are relatively weak or subtle, it can be used to continuously monitor the data of individual bank customers or transactions, looking for any signs that match the patterns it has been trained on.
- Speed and efficiency: papAI platform can analyze vast amounts of data in a fraction of the time it would take a human. This means that organizations can detect fraudulent activity faster and more efficiently, reducing the overall impact of fraud on the organization.
- Improved accuracy: papAI platform can be trained on large datasets of past fraudulent activity, allowing them to identify patterns and anomalies that may not be immediately apparent to humans. This can lead to more accurate fraud detection and fewer false positives.
- Increased scalability: As the volume of data increases, it becomes increasingly difficult for humans to keep up with the analysis required to detect fraudulent activity. Using papAI platform, on the other hand, can scale easily to meet the demands of a growing organization, making them an ideal solution for companies that are experiencing rapid growth.
- Reduced cost: The use of papAI platform can help to reduce the cost of fraud detection, as it reduces the need for manual labor and allows organizations to allocate resources more efficiently.
- Continuous monitoring: papAI platform can be used to continuously monitor for fraudulent activity, allowing organizations to catch fraudulent activity as it happens and take immediate action to prevent further losses.
With papAI platform, we will first resample our database in order to have the same number of fraud and non-fraud cases, we will also use machine learning, in order to launch an automatic learning to improve the accuracy and reliability of fraud detection.
We will use the Banksim dataset to detect fraudulent transactions. This synthetically generated dataset consists of payments from various customers made at different times and with different amounts.
As we can see in the first rows below the dataset has 9 feature columns and a target column. The feature columms are :
- Step: This feature represents the day from the start of simulation. It has 180 steps so simulation ran for virtually 6 months.
- Customer: This feature represents the customer id
- zipCodeOrigin: The zip code of origin/source.
- Merchant: The merchant’s id
- zipMerchant: The merchant’s zip code
- Age: Categorized age
0: <= 18, 1: 19-25, 2: 26-35, 3: 36-45, 4: 46:55, 5: 56:65, 6: > 65 U: Unknown
- Gender: Gender for customer : E : Enterprise, F: Female, M: Male, U: Unknown
- Category: Category of the purchase. I won’t write all categories here, we’ll see them later in the analysis.
- Amount: Amount of the purchase
- Fraud: Target variable which shows if the transaction fraudulent(1) or benign(0)
2 -Data analysis and preparation
a) Data analysis
Before launching machine learning to predict fraud cases, we start by analyzing the input data. papAI platform is an artificial intelligence tool, and also a BI tool. When importing a database, papAI platform detects the type of data for each column, and automatically calculates statistics that allow to get familiar with the database. The statistics are first computed on a sample, which is 20 000 rows, and if you want, you can Recompute whole dataset stats.
As you can see from the statistics table above, there is only one value in the zipCodeOrigin and zipMerchant columns.
If we want details about one of our features, we have to press the arrow on the right of the feature.
In the picture above, we can see that the number of transactions present on our dataset is 594,643 transactions. 98.79% are non-fraud cases, and there are only 1.21% of our data that represent fraud cases.
We notice that there is a huge imbalance in our dataset, this can lead to a biased model, because the model can learn to “always” predict the non-fraud which is the majority class, and will not succeed in predicting the fraud cases. In this case, if the model has a high accuracy rate, it can be due to the fact that it always predicts the majority class, which is not necessarily very useful.
b) Prepare Data
After analyzing our data, we will prepare them for prediction. To do this, we split our database, taking 80% of the data as training dataset, and 20% as test dataset. One thing we must not forget is to balance the percentage of the fraud classes on the two datasets. On Papai, we choose the Split rows operation, with the desired percentage for the two databases we want to generate (Train, Test) Then we apply the stratified method on the Fraud class.
3 -Training Data
a) Create a prototype
After preparing the training data and test data, we will create our training prototype. First, we will select the features to use for the training, we saw in the analysis that the zipCodeOrigin and zipMerchant columns both contain a single value, which means that it is not necessary to add them in the training.
Second, we choose the machine learning models to use. We choose all the models, in order to compare the results and choose the model with the best result for the prediction.
Third, we choose to balance our data and shuffle them.
Fourth, and this is a very important step when we have data with an unbalanced target class, we choose a data augmentation model. For example, the SMOTE allows you to generate fraud data based on existing data, in order to have the same number of fraud and non-fraud cases.
After running several models at the same time, we have a ranking of the results obtained, on which we can notice that the best result was obtained with Random Forest in dark green, and the lowest result with the SVM models. We promote the best result to make the prediction with.
4 -Analyse & Understand the model
In order to assess the quality of the trained model, papAI platform integrates an interpretability and evaluation module for each experiment created. In our case, we executed the Random Forest with which we had 96.17% accuracy.
On the confusion matrix, noticing a very large number of true negative and true positive which represent the correct predictions of the model. and a very small percentage of false positive and false negative which represent the incorrect predictions. this means that our classification model has good prediction results.
Papai proposes two interpretability modules, a global interpretability which is computed on a part of the training dataset, and a local interpretability, which is computed on the test dataset.
On Global interpretability:
We can see that the most important feature to determine the value of the target ‘fraud’ is the feature ‘amount’, this feature represents the amount spent in each transaction. the model shows us that fraudsters spend large amounts of money. so the higher the value in amount, the more the model distrusts the transaction.
In the second line we have the feature ‘category’, which represents the category of the purchase, which can be ‘transport, food, health, beauty, fashion, tech, home….. so most of the fraudsters spend in similar categories.
In the third row, we have the feature ‘merchant’, which represents the merchant ID, so this shows us that most fraudsters target merchants for fraud. After our model is trained on our dataset, it determines rules on fraud detection. The importance feature shows us an overview of how the model reacts with the values in our dataset.
6 - Predict the training set
The result of the prediction is an output of the test data set with three columns added:
- ‘proba_0′ the probability that the transaction is a non-fraud case.
- ‘proba_1′ is the probability that the transaction is a fraud case.
- ‘predicted_fraud’ is the value predicted by the model according to the probabilities ‘proba_0’ and ‘proba_1’.
b) Local interpretability
Local interpretability consists of choosing a specific transaction, and seeing the contribution of the features positively or negatively in determining a case of fraud or non-fraud. The picture below is a local interpretability of an original fraud transaction, we can see that our model chose to classify this transaction as fraud because it was a woman who spent $1304.93 in the travel category at a suspect merchant. and the value 5 which represents the age range between 56 & 65, contributed negatively to predicting fraud, but with little influence. so at the end, for this transaction, Papai tells us that he is 94.51% sure that this transaction is a fraud.
In conclusion, the papAI platform has the potential to transform the way banks detect and prevent fraud. By analysing transactional and customer behaviour data with high local interpretability of results, banks can identify patterns and anomalies that may indicate fraudulent activity. While the use of AI for fraud detection in banks presents challenges, the benefits of increased accuracy and speed make it a promising approach for the future.
Interested in discovering papAI ?
Our commercial team is at your disposal for any question.