Skip to content

SharadChoudhury/Credit-Card-Fraud-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit-Card-Fraud-Detection

In this project, I leveraged AWS, HIve, Spark and Sqoop tools to identify the fraudulent transactions in real time to mitigate financial risks and potential losses for the organization. Proactively identifying and addressing fraud also minimizes the impact on customers and preserves their trust in the business.

image

Dataset for :

  • Past cards transactions: card_transactions.csv
  • Member details - card_member table in AWS RDS server
  • Credit score details - member_score table in AWS RDS server
  • card_transactions streaming data - Kafka

Table info

  • card_member

    • card_id: This refers to the card number.
    • member_id: This is the 15-digit member ID of the cardholder.
    • member_joining_dt: This is the date and time of joining of new member.
    • card_purchase_dt: This is the date on which the card was purchased.
    • country: This is the country in which the card was purchased.
    • city: This is the city in which the card was purchased.
  • card_transactions

    • card_id: This refers to the card number.
    • member_id: This is the 15-digit member ID of the cardholder.
    • amount: This is the amount that is swiped with respect to the card_id.
    • postcode: This is the ZIP code at which this card was swiped (marking the location of an event).
    • pos_id: This is the merchant’s POS terminal ID, using which the card was swiped.
    • transaction_dt: This is the date and time of the transaction.
    • status: This indicates whether the transaction was approved or not, with a genuine/fraud value.
  • member_score

    • member_id: This is the 15-digit member ID of the cardholder.
    • score: This is the score assigned to a member defining their credit history, generated by upstream systems.
  • Kafka data is in json format:

  {
"card_id":348702330256514,
"member_id": 000037495066290,
"amount": 9084849,
"pos_id": 614677375609919,
"postcode": 33946,
"transaction_dt": "11-02-2018 00:00:00"
}

Problem statement

  • Bring the card_member and member_score data from the AWS RDS into a Hadoop platform.
  • Load the historical card transactions from csv file into a NoSQL database.
  • This data is then processed to fill data in the look-up table.
  • The lookup table should have following fields:
    • Card id
    • Upper control limit (UCL) : This parameter is an indicator of the transaction pattern associated with a particular customer. The UCL value needs to be calculated for each card_id for the last 10 transactions. UCL = Moving average + 3 × (Standard deviation)
    • Postcode of the last transaction
    • Transaction date of the last transaction
    • The credit score of the member
  • Now, the data from the several POS systems will flow inside the architecture through a queuing system such as Kafka.
  • The POS data from Kafka will be consumed by the streaming data processing framework to identify the authenticity of the transactions.
  • Once the POS data from Kafka is entered into the stream processing layer, it is then assessed based on some parameters defined by the rules. The values for these parameters are fetched from the look-up table.
  • The Rules are:
    • Transaction amount < UCL of that particular card_id
    • Credit score of the card holder > 200
    • The speed of travel between previous transaction zipcode and current transaction zipcode is within human limits.
  • The transaction is allowed to complete only when the results are positive for these rules, and classified as GENUINE.
  • If the result for any rule is negative, then the transaction should be classified as FRAUD.
  • Once the transaction is classified as GENUINE, then, corresponding to the card ID in the look-up table, the postcode and the transaction date of the current transaction need to be updated as per the last transaction. These fields should only be updated if the transaction gets classified as GENUINE.
  • The card_transactions table also needs to be updated with all the details along with the classification of the transactions.

Steps for execution

Create an AWS EMR cluster with the following services:

  1. HBase
  2. Hive
  3. Spark
  4. Hue
  5. Sqoop

Install Happybase:

sudo su
sudo yum update
yum install gcc
sudo yum install python3-devel
pip install happybase
pip install pandas

To enable sqoop connection to RDS:

sudo su
wget https://de-mysql-connector.s3.amazonaws.com/mysql-connector-java-8.0.25.tar.gz
tar -xvf mysql-connector-java-8.0.25.tar.gz
cd mysql-connector-java-8.0.25/
sudo cp mysql-connector-java-8.0.25.jar /usr/lib/sqoop/lib/

Upload all relevant files to EMR using scp

Batch layer tasks

Create card_transactions table in Hbase and load data in it using the csv file and happybase.

create 'card_transactions', 'cf1'

Run the copy_to_hbase.py in the same directory as your csv file to insert data in HBase table in batches.

Import the card_member and member_score table into Hive using Sqoop

sqoop import \
--connect jdbc:mysql://upgradawsrds1.cyaielc9bmnf.us-east-1.rds.amazonaws.com/cred_financials_data \
--table card_member \
--username upgraduser --password upgraduser \
--hive-import \
--hive-table card_member \
--create-hive-table \
--fields-terminated-by ',' \
-m 1
sqoop import \
--connect jdbc:mysql://upgradawsrds1.cyaielc9bmnf.us-east-1.rds.amazonaws.com/cred_financials_data \
--table member_score \
--username upgraduser --password upgraduser \
--hive-import \
--hive-table member_score \
--create-hive-table \
--fields-terminated-by ',' \
-m 1

Creating Lookup table

Copy csv file to hadoop:

hadoop fs -put /home/hadoop/card_transactions.csv /user/hadoop/

We'll create the lookup table which stores the following details for each unique card_id

  • Card id
  • Upper control limit (UCL)
  • Postcode of the last transaction
  • Transaction date of the last transaction
  • The credit score of the member

We'll create this table using the member_score and card_transactions table in Hive.

Run commands in Hive_commands.sql in Hive shell to create the lookup table using Hive-Hbase integration Now, the lookup table in Hbase is hbase_lookup_table.

Streaming layer tasks

Ingesting new records from Kafka

Now, that we have the card_transactions table and the hbase_lookup_table in HBase:

  • We can read the incoming records from Kafka and apply the set of rules on each record and decide the ‘status’ of the transaction based on the rules as 'GENUINE' or 'FRAUD'.
  • Then we update the postcode and last transaction date for the card_id for the Genuine transactions.
  • Finally, we write all the transactions with their status to card_transactions Hbase table.
  • Also, write the stream to the console.

Copy the python folder to EMR with the same file structure. Then run :

cd python/src
zip src.zip __init__.py rules/* db/*
export SPARK_KAFKA_VERSION=0.10

Now, run the driver.py file using the spark-submit command:

spark-submit --py-files src.zip --files uszipsv.csv --packages org.apache.spark:spark-sql-kafka-0- 10_2.11:2.4.5 driver.py

About

Credit card fraud detection of real time transaction data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages