Making Bank Predictive and Real-Time

Making Bank Predictive and Real-Time
Anurag Shrivastava
ING Bank

MAKING BANK
PREDICTIVE AND REAL
TIME
BY ANURAG SHRIVASTAVA
2014 HADOOP SUMMIT APRIL 2-3, 2014 AMSTERDAM
Hadoop: Experimentation to Production
2

What is predictive & real-time?
 Predictive
 Ability to predict the behavior of customer based
upon data and generation of appropriate actions
 Real-Time
 Ability to take appropriate actions near real-time
based upon the events generated during
customer interactions
3

Data Platform: Challenges
Data Silos
160 Oracle Instances, 70 TB
30 Data Warehouses
Batch Processing
120000 ETL Jobs, OWB, ODI
Limited Analytics
Segmentation for Campaigns
Using SAS
4

Data Platform: Initial Thoughts
• Less effort required in database maintenance
• 20 to 60 Times Query Performance Improvement
• Hardware Maintenance is handled by the vendor
• Powerful data transformation and analytics
capabilities with the help of accelerators
• Fits well in a large data center
• Expensive data storage EUR 15K to 70 K per
terabyte
• Proprietary technology and hardware
• Explosive growth in data storage requirements due
to logs from online channels mean more investment
is needed in Netezza
All the trademarks and copyrights are acknowledged by the author.
5

Data Platform Target Architecture
Enterprise Data Overlay (Datastage,…)
BI Tools and Applications
Enterprise SystemsExternal Systems
Data Marts (Netezza)
Enterprise Data Warehouse (Hadoop)
Predictive Analytics Lab
(Hadoop)
6

Challenges with Hadoop in a Large
Bank
 New technology – first mover disadvantage
 Experienced people are hard to find
 Attention and hype from CXO (read pressure
to deliver)
 Do we really have a big data problem?
 No clear leader in the vendor space
 Open source and Java focused community
7

IT Challenges in a Large Bank
High End Servers Virtualization
Storage Area Networks
Shared Services
Build Server, Monitoring &
Back-up etc.
8

Our Journey
Play Area Big Data Predictive Analytics
Lab
Production
System
9

Play Area Big Data
 Goals
 Quickly learn about Hadoop Capabilities
 Create interest and awareness in the organization
 What we did?
 Setup a small Hadoop cluster with old unused HP
blades in a test area
 Get started quickly with a distribution recommended
by a consulting company
 One time old data load, no ETL, no scheduling etc.
 Small team experimenting data using R
10

 Goals
 Capability to build predictive models for business
cases
 Secure environment designed for data scientists
who can build predictive models
 What we did?
 Hadoop cluster with brand new Hardware
 Managed to install it in a data centre
 Secure and monitored
 Based upon a HDP 2.0
11

Implementation Challenges (1/3)
 Securing Hadoop
 Strong perimeter security to a limited set of users
 Multi-factor authentication
 Stepping stone to Hadoop cluster using Citrix
 Enterprise repository for deployment with pre-
screened jars for Trojans and Malware
 Monitoring for various events
12

 Hardware
 Data Centre not ready for cheap/commodity hardware
 Automated deployment only possible on VMs so no
install possible on bare metal
 Compromise between costs and DC standards
 Automated Provisioning of Hadoop
 Ansible for automated provisioning of 18 nodes
 Ambari for monitoring the cluster
 It is easy to automate the provision and highly
recommended
13

 Rapid Pace of Innovation
 Hadoop community is very active on innovation
front
 As we built our Hadoop cluster, new names such
as Spark, Accumulo and Falcon pop-up
 Infra processes are waterfall based forcing a
pause every time a new tool pops-up
 Number of distributions to choose from
14

Stepping Stone (Citrix)
18 x Hadoop Nodes
GIT, Libraries, Build
Tools
Monitoring Services
Data Files in Batches
Dedicated VLAN Shared ServicesShared Services
SMTP Relay
Internet via
Corporate
Infrastructure
Firewall Rules
Guard the Perimeter
Security
Of Hadoop Cluster
18 x Hadoop Nodes
15

 Scrum – 3 week long
sprint
 Data
Scientists, Hadoop
Engineers in the team
 Every sprint
demonstrates working
software to the
stakeholders
 Hortonworks HDP
2.0
 Hive
 Ambari and Ansible
 R Studio
 Hue, HCatalog
Team and Process Lab Environment
16

Our Journey
Production
System
Play Area Big Data Predictive Analytics
Lab
17

Production System
 Goals
 Meet diverse informations needs of business
 Deploy predictive model for production
 Cut-down data storage costs without
compromising the reliability and availability
 What we need?
 Fine grained security
 ETL and workflow tools
 Automated Deployment of Predictive Models
 Disaster recovery
18

Real-time
 Hadoop is a batch processing system not
designed for real time analytics
 A predictive model that has to perform near
real-time would require a deployment platform
different from Hadoop
 Real time means near real-time or micro-
batches
 Candidate tools for evaluation: Storm, Spark
and Infosphere Streams
19

Business Cases
Improve the segmentation
for marketing
Personal spending forecasPredict mortgage defaulters
20

Hadoop Benefits
 Data Hub or Enterprise Memory
 Schema on read
 Cheap but reliable storage
 Fault Tolerant
 Lower cost of hardware and licenses
 Data Driven Applications
 Run complex queries and predictive analytics
models
 Build predictive models
 Increase revenue and lower risk
21

Lessons learned
 Hadoop is ready for early adopters; it can save
you costs and accelerate predictive analytics
 Hadoop is not the complete solution to build a
real-time and predictive platform
 Business case driven experimentation has
greater chance of acceptance than pure
technical exploration in a large enterprise
 External expertise and close link with the
community is valuable
22

Contact Details
anurag.shrivastava@ing.nl or shri2201@gmail.com
Twitter: @shri2201
Questions23

Making Bank Predictive and Real-Time

Making Bank Predictive and Real-Time

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Making Bank Predictive and Real-Time

Similaire à Making Bank Predictive and Real-Time (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Making Bank Predictive and Real-Time

Notes de l'éditeur