SlideShare une entreprise Scribd logo
1  sur  23
Apache Spark Into and
Credit Card Fraud
Detection Demo
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive, Phoenix,
S3, etc
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
Spark Driver
Zeppelin Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Getting Started
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Working with Spark
• Spark shell
• R Studio
• Notebook
• Zeppelin
• Jupyter
• Data Science Platform
• RapidMiner
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning - Reading CSV files into DataFrame
Expecting numbers but inference created string columns
Suspect issue with data…....
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark is fast but lazy
•Transformations
• Specify which data to read
• Modify data
•Actions
• Show data
• Write data
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning - Filtering
Header and
Case data on
Same CSV line
Filter DataFrame with
expressions
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cleaning – SQL operations
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write to File
Table for SQL
Save clean
data as ORC
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Create Training and Test Data
Create
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Train Model
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 15
Improved
Experience
/Reduced Cost
Immediate
Customer
Feedback
Years of
Customer
Transaction Data
Fraud Detection
Complete
Customer
Profile
Real time
ingest of
transactions
Proactively identify potential
fraudulent transactions to
protect the customer and
improve customer experience
• Proactively monitor every credit
card transaction using machine
learning to catch potential fraud
• Customer Service Analyst reviews
flagged transactions in real time via
a next generation application
running on the connected platform
• HDF controls real time flow of data
in and out of the connected
platform to the various source and
destination points
Innovate
Renovate
Purchase
Behavior
Insight
Journey to Fraud Detection
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Card Fraud
 Requirement: Detect fraudulent transactions.
 Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time.
 Design
– Distance: How far can one travel over a period of time before it is fraudulent?
– Category: How can we detect a purchase that a customer wouldn’t likely make?
– Frequency: How can we detect purchasing patterns that do not resemble the card holder?
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outlier Detection: identify abnormal
patterns
Example: identify anomalies
Features:
- Time frequency
- Amount in Category
- Distance
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Fraud Detection Application Architecture
DATA AT
REST
DATA IN MOTION
Check out the
Demo and Blog!
Actionable Intelligence
fueled by
Adaptive
Machine Learning
Customer
Service
Analyst
Customer
DATA
SOURCES
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
Behavior Modeling
Model 1 Model 2 Model 3
Fraud Detection
Transaction History
MoveTransactions
Time
Train
Predict
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 20
Credit Fraud Analyst Inbox
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 21
Hortonworks Data Flow
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 22
Hortonworks Data Flow
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 23
Hortonworks Data Flow

Contenu connexe

Tendances

Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
Albert Meroño-Peñuela
 

Tendances (20)

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 

Similaire à Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similaire à Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo (20)

Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Building a modern end-to-end open source Big Data reference application
Building a modern end-to-end open source Big Data reference applicationBuilding a modern end-to-end open source Big Data reference application
Building a modern end-to-end open source Big Data reference application
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 

Dernier

Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 

Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card Fraud Demo

  • 1. Apache Spark Into and Credit Card Fraud Detection Demo
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Carolyn Duby • Big Data Solutions Architect • High performance data intensive systems • Data science • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby Github: carolynduby • Hortonworks • Innovation through data • Enterprise ready, 100% open source, modern data platforms • Engineering, Technical Support, Professional Services, Training
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache SPARK • Distributed processing efficiently crunches large data sets • Optimized • Horizontally scalable with multi tenancy • Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SPARK Libraries • Same API for all data sources • SQL - http://spark.apache.org/sql/ • Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ • Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ • Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ • Complex event processing and data ingest
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture Spark Driver Zeppelin Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Getting Started • Use a distribution • Curated set of compatible open source projects • Sandbox - single node cluster in VM or Azure • https://hortonworks.com/products/sandbox/ • Hortonworks Community Connection • http://community.hortonworks.com • On premise • Use Apache Ambari to manage on premise physical hardware • Cloud • Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak) • AWS, Azure, Google Cloud
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Working with Spark • Spark shell • R Studio • Notebook • Zeppelin • Jupyter • Data Science Platform • RapidMiner
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cleaning - Reading CSV files into DataFrame Expecting numbers but inference created string columns Suspect issue with data…....
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark is fast but lazy •Transformations • Specify which data to read • Modify data •Actions • Show data • Write data
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cleaning - Filtering Header and Case data on Same CSV line Filter DataFrame with expressions
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cleaning – SQL operations
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write to File Table for SQL Save clean data as ORC
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Create Training and Test Data Create
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Train Model
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 15 Improved Experience /Reduced Cost Immediate Customer Feedback Years of Customer Transaction Data Fraud Detection Complete Customer Profile Real time ingest of transactions Proactively identify potential fraudulent transactions to protect the customer and improve customer experience • Proactively monitor every credit card transaction using machine learning to catch potential fraud • Customer Service Analyst reviews flagged transactions in real time via a next generation application running on the connected platform • HDF controls real time flow of data in and out of the connected platform to the various source and destination points Innovate Renovate Purchase Behavior Insight Journey to Fraud Detection
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credit Card Fraud  Requirement: Detect fraudulent transactions.  Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt and make smarter decisions over time.  Design – Distance: How far can one travel over a period of time before it is fraudulent? – Category: How can we detect a purchase that a customer wouldn’t likely make? – Frequency: How can we detect purchasing patterns that do not resemble the card holder?
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outlier Detection: identify abnormal patterns Example: identify anomalies Features: - Time frequency - Amount in Category - Distance
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credit Fraud Detection Application Architecture DATA AT REST DATA IN MOTION Check out the Demo and Blog! Actionable Intelligence fueled by Adaptive Machine Learning Customer Service Analyst Customer DATA SOURCES
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture Behavior Modeling Model 1 Model 2 Model 3 Fraud Detection Transaction History MoveTransactions Time Train Predict
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 20 Credit Fraud Analyst Inbox
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 21 Hortonworks Data Flow
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 22 Hortonworks Data Flow
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 23 Hortonworks Data Flow

Notes de l'éditeur

  1. Customer transactions are reviewed in real-time by the connected data platform using machine learning
  2. Before we can detect an outlier, we have to define it. The most intuitive definition I’ve seen is from a 1980 book Identification of Outliers (Hawkins): “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” Speaker: It’s worth repeating that all of these topics a deep enough to have entire university library shelves filled with books on the topic. We’re just skimming the surface. Anomaly detection is related to clustering, but almost its inverse. If points that are similar to each other cluster together (representing “normal” behavior or patterns), we want to find the points that are NOT in any cluster.
  3. Modern Data Applications, be they custom or off the shelf, are fueled by Data in Motion and Data at Rest and convert yesterday’s impossible challenges into today’s new products, cures, and life saving innovations.   Cyber security leaders are building powerful apps to detect threats to digital information. Leading pharma, automotive, electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields. And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field, or to the cash register to do things that have never before been possible. [NEXT SLIDE]