Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

Apache Spark Into and
Credit Card Fraud
Detection Demo

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training

Apache SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive, Phoenix,
S3, etc

SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest

Architecture
Spark Driver
Zeppelin Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Getting Started
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud

Working with Spark
• Spark shell
• R Studio
• Notebook
• Zeppelin
• Jupyter
• Data Science Platform
• RapidMiner

Cleaning - Reading CSV files into DataFrame
Expecting numbers but inference created string columns
Suspect issue with data…....

Spark is fast but lazy
•Transformations
• Specify which data to read
• Modify data
•Actions
• Show data
• Write data

Cleaning - Filtering
Header and
Case data on
Same CSV line
Filter DataFrame with
expressions

Cleaning – SQL operations

Write to File
Table for SQL
Save clean
data as ORC

Create Training and Test Data
Create

Train Model

Page 15
Improved
Experience
/Reduced Cost
Immediate
Customer
Feedback
Years of
Customer
Transaction Data
Fraud Detection
Complete
Customer
Profile
Real time
ingest of
transactions
Proactively identify potential
fraudulent transactions to
protect the customer and
improve customer experience
• Proactively monitor every credit
card transaction using machine
learning to catch potential fraud
• Customer Service Analyst reviews
flagged transactions in real time via
a next generation application
running on the connected platform
• HDF controls real time flow of data
in and out of the connected
platform to the various source and
destination points
Innovate
Renovate
Purchase
Behavior
Insight
Journey to Fraud Detection

Credit Card Fraud
 Requirement: Detect fraudulent transactions.
 Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time.
 Design
– Distance: How far can one travel over a period of time before it is fraudulent?
– Category: How can we detect a purchase that a customer wouldn’t likely make?
– Frequency: How can we detect purchasing patterns that do not resemble the card holder?

Outlier Detection: identify abnormal
patterns
Example: identify anomalies
Features:
- Time frequency
- Amount in Category
- Distance

Credit Fraud Detection Application Architecture
DATA AT
REST
DATA IN MOTION
Check out the
Demo and Blog!
Actionable Intelligence
fueled by
Adaptive
Machine Learning
Customer
Service
Analyst
Customer
DATA
SOURCES

Architecture
Behavior Modeling
Model 1 Model 2 Model 3
Fraud Detection
Transaction History
MoveTransactions
Time
Train
Predict

Page 20
Credit Fraud Analyst Inbox

Page 21
Hortonworks Data Flow

Page 22

Page 23

Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

Similaire à Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo (20)

Dernier

Dernier (20)

Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

Notes de l'éditeur