Contenu connexe Similaire à Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo (20) Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive, Phoenix,
S3, etc
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
Spark Driver
Zeppelin Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Getting Started
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Working with Spark
• Spark shell
• R Studio
• Notebook
• Zeppelin
• Jupyter
• Data Science Platform
• RapidMiner
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning - Reading CSV files into DataFrame
Expecting numbers but inference created string columns
Suspect issue with data…....
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark is fast but lazy
•Transformations
• Specify which data to read
• Modify data
•Actions
• Show data
• Write data
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning - Filtering
Header and
Case data on
Same CSV line
Filter DataFrame with
expressions
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning – SQL operations
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write to File
Table for SQL
Save clean
data as ORC
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Create Training and Test Data
Create
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 15
Improved
Experience
/Reduced Cost
Immediate
Customer
Feedback
Years of
Customer
Transaction Data
Fraud Detection
Complete
Customer
Profile
Real time
ingest of
transactions
Proactively identify potential
fraudulent transactions to
protect the customer and
improve customer experience
• Proactively monitor every credit
card transaction using machine
learning to catch potential fraud
• Customer Service Analyst reviews
flagged transactions in real time via
a next generation application
running on the connected platform
• HDF controls real time flow of data
in and out of the connected
platform to the various source and
destination points
Innovate
Renovate
Purchase
Behavior
Insight
Journey to Fraud Detection
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Card Fraud
Requirement: Detect fraudulent transactions.
Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time.
Design
– Distance: How far can one travel over a period of time before it is fraudulent?
– Category: How can we detect a purchase that a customer wouldn’t likely make?
– Frequency: How can we detect purchasing patterns that do not resemble the card holder?
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outlier Detection: identify abnormal
patterns
Example: identify anomalies
Features:
- Time frequency
- Amount in Category
- Distance
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Fraud Detection Application Architecture
DATA AT
REST
DATA IN MOTION
Check out the
Demo and Blog!
Actionable Intelligence
fueled by
Adaptive
Machine Learning
Customer
Service
Analyst
Customer
DATA
SOURCES
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
Behavior Modeling
Model 1 Model 2 Model 3
Fraud Detection
Transaction History
MoveTransactions
Time
Train
Predict
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 20
Credit Fraud Analyst Inbox
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 21
Hortonworks Data Flow
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 22
Hortonworks Data Flow
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 23
Hortonworks Data Flow
Notes de l'éditeur
Customer transactions are reviewed in real-time by the connected data platform using machine learning
Before we can detect an outlier, we have to define it. The most intuitive definition I’ve seen is from a 1980 book Identification of Outliers (Hawkins):
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”
Speaker: It’s worth repeating that all of these topics a deep enough to have entire university library shelves filled with books on the topic. We’re just skimming the surface.
Anomaly detection is related to clustering, but almost its inverse.
If points that are similar to each other cluster together (representing “normal” behavior or patterns), we want to find the points that are NOT in any cluster.
Modern Data Applications, be they custom or off the shelf, are fueled by Data in Motion and Data at Rest and convert yesterday’s impossible challenges into today’s new products, cures, and life saving innovations.
Cyber security leaders are building powerful apps to detect threats to digital information. Leading pharma, automotive, electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields. And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field, or to the cash register to do things that have never before been possible.
[NEXT SLIDE]