2. MAKING BANK
PREDICTIVE AND REAL
TIME
BY ANURAG SHRIVASTAVA
2014 HADOOP SUMMIT APRIL 2-3, 2014 AMSTERDAM
Hadoop: Experimentation to Production
2
3. What is predictive & real-time?
Predictive
Ability to predict the behavior of customer based
upon data and generation of appropriate actions
Real-Time
Ability to take appropriate actions near real-time
based upon the events generated during
customer interactions
3
4. Data Platform: Challenges
Data Silos
160 Oracle Instances, 70 TB
30 Data Warehouses
Batch Processing
120000 ETL Jobs, OWB, ODI
Limited Analytics
Segmentation for Campaigns
Using SAS
4
5. Data Platform: Initial Thoughts
• Less effort required in database maintenance
• 20 to 60 Times Query Performance Improvement
• Hardware Maintenance is handled by the vendor
• Powerful data transformation and analytics
capabilities with the help of accelerators
• Fits well in a large data center
• Expensive data storage EUR 15K to 70 K per
terabyte
• Proprietary technology and hardware
• Explosive growth in data storage requirements due
to logs from online channels mean more investment
is needed in Netezza
All the trademarks and copyrights are acknowledged by the author.
5
6. Data Platform Target Architecture
Enterprise Data Overlay (Datastage,…)
BI Tools and Applications
Enterprise SystemsExternal Systems
Data Marts (Netezza)
Enterprise Data Warehouse (Hadoop)
Predictive Analytics Lab
(Hadoop)
6
7. Challenges with Hadoop in a Large
Bank
New technology – first mover disadvantage
Experienced people are hard to find
Attention and hype from CXO (read pressure
to deliver)
Do we really have a big data problem?
No clear leader in the vendor space
Open source and Java focused community
7
8. IT Challenges in a Large Bank
High End Servers Virtualization
Storage Area Networks
Shared Services
Build Server, Monitoring &
Back-up etc.
8
10. Play Area Big Data
Goals
Quickly learn about Hadoop Capabilities
Create interest and awareness in the organization
What we did?
Setup a small Hadoop cluster with old unused HP
blades in a test area
Get started quickly with a distribution recommended
by a consulting company
One time old data load, no ETL, no scheduling etc.
Small team experimenting data using R
10
11. Predictive Analytics Lab
Goals
Capability to build predictive models for business
cases
Secure environment designed for data scientists
who can build predictive models
What we did?
Hadoop cluster with brand new Hardware
Managed to install it in a data centre
Secure and monitored
Based upon a HDP 2.0
11
12. Implementation Challenges (1/3)
Securing Hadoop
Strong perimeter security to a limited set of users
Multi-factor authentication
Stepping stone to Hadoop cluster using Citrix
Enterprise repository for deployment with pre-
screened jars for Trojans and Malware
Monitoring for various events
12
13. Implementation Challenges (2/3)
Hardware
Data Centre not ready for cheap/commodity hardware
Automated deployment only possible on VMs so no
install possible on bare metal
Compromise between costs and DC standards
Automated Provisioning of Hadoop
Ansible for automated provisioning of 18 nodes
Ambari for monitoring the cluster
It is easy to automate the provision and highly
recommended
13
14. Implementation Challenges (3/3)
Rapid Pace of Innovation
Hadoop community is very active on innovation
front
As we built our Hadoop cluster, new names such
as Spark, Accumulo and Falcon pop-up
Infra processes are waterfall based forcing a
pause every time a new tool pops-up
Number of distributions to choose from
14
15. Predictive Analytics Lab
Stepping Stone (Citrix)
18 x Hadoop Nodes
GIT, Libraries, Build
Tools
Monitoring Services
Data Files in Batches
Dedicated VLAN Shared ServicesShared Services
SMTP Relay
Internet via
Corporate
Infrastructure
Firewall Rules
Guard the Perimeter
Security
Of Hadoop Cluster
18 x Hadoop Nodes
15
16. Predictive Analytics Lab
Scrum – 3 week long
sprint
Data
Scientists, Hadoop
Engineers in the team
Every sprint
demonstrates working
software to the
stakeholders
Hortonworks HDP
2.0
Hive
Ambari and Ansible
R Studio
Hue, HCatalog
Team and Process Lab Environment
16
18. Production System
Goals
Meet diverse informations needs of business
Deploy predictive model for production
Cut-down data storage costs without
compromising the reliability and availability
What we need?
Fine grained security
ETL and workflow tools
Automated Deployment of Predictive Models
Disaster recovery
18
19. Real-time
Hadoop is a batch processing system not
designed for real time analytics
A predictive model that has to perform near
real-time would require a deployment platform
different from Hadoop
Real time means near real-time or micro-
batches
Candidate tools for evaluation: Storm, Spark
and Infosphere Streams
19
20. Business Cases
Improve the segmentation
for marketing
Personal spending forecasPredict mortgage defaulters
20
21. Hadoop Benefits
Data Hub or Enterprise Memory
Schema on read
Cheap but reliable storage
Fault Tolerant
Lower cost of hardware and licenses
Data Driven Applications
Run complex queries and predictive analytics
models
Build predictive models
Increase revenue and lower risk
21
22. Lessons learned
Hadoop is ready for early adopters; it can save
you costs and accelerate predictive analytics
Hadoop is not the complete solution to build a
real-time and predictive platform
Business case driven experimentation has
greater chance of acceptance than pure
technical exploration in a large enterprise
External expertise and close link with the
community is valuable
22
My name is Anurag Shrivastava.I lead an engineering team that builds data platform and customer intelligence solutions.I work for ING bank which is a very large retail bank in the Netherlands.Stand still and talk slowly
Everyday we get tens of email containing one promotion or another.We throw away most of them. However, if you get an email for special cake price two days before wife’s birthday, you are likely to be happy. Most of marketing can go waste if you do not know what you want. Right offer at wrong moment rarely works.Somebody tries to use your credit card in Australia when you are watching my presentation. How soon your credit card company know that and inform you?
Data silos have come up over a period of time because of the specific needs of a value chain. This causes data duplication and several point to point interfaces.Batch processing is based upon the processing the files at intervals. Due to multiple data silos, we have to process the same file several time leading to complex ETL routines and short window for fault recovery.At this moment, analytics is limited to analytics on structured data for marketing purposes. Data is analyzed at rest. Model is built and deployed for the campaigns.We process around 1000 batch files daily.
Netezza seemed to a great idea because it offered consolidation and lower maintenance overhead. It also much faster than our Oracle based DWH. However this was before Hadoop shot into prominence. This decision was taken in 2011 when we were not familiar with Hadoop. The cost of Netezza is high. We also had to redo our ETL.
Explain how this stack has been built starting from bottom to top.
First mover disadvantage in contrast with Oracle which is well known.BI departments are SQL focused. Stacks do not change a lot over a long period of time.Big Data companies are very small. When our CIO/CFO visited these companies in the Silicon Valley, they were surprised by their small size.
Data centres have designed for High End server/ Hadoop works on cheap serversRisk of fire/Risk ot shutting down the entire networkVirtualization is used heavily. Concept of data locality are foreign here.Large IT organizations share a lot: Monitoring, Build and Backup is shared so a new system has to be compatible with it or it becomes special and cost of service goes up.A IT Infra engineer sees Hadoop as an elephant in a zoo full of tigers.
Start small, learn and move on.
Once you combine data from many sources, its sensitivity increasesFine grained security on Hadoop is still not ready.
We settled for hardware and software recommended by our data centre though we could have saved more ( and spent more time) with cheaper hardware.We used HP SL4540. One node EUR 16000 for 16 core CPU Approx 30 TB per node.You do not want to install 18 nodes manually.
Every new tool means following a cumbersome change process.We choose Hortonworks because of their clean open source approach.Innovation can be tempting but you can not implement every new tool.
Explain the purpose of each block.
So we can build predictive model but how to deploy them has to be figured out.Real time – yet to be done.