SlideShare une entreprise Scribd logo
1  sur  24
Making Bank Predictive and Real-Time
Anurag Shrivastava
ING Bank
MAKING BANK
PREDICTIVE AND REAL
TIME
BY ANURAG SHRIVASTAVA
2014 HADOOP SUMMIT APRIL 2-3, 2014 AMSTERDAM
Hadoop: Experimentation to Production
2
What is predictive & real-time?
 Predictive
 Ability to predict the behavior of customer based
upon data and generation of appropriate actions
 Real-Time
 Ability to take appropriate actions near real-time
based upon the events generated during
customer interactions
3
Data Platform: Challenges
Data Silos
160 Oracle Instances, 70 TB
30 Data Warehouses
Batch Processing
120000 ETL Jobs, OWB, ODI
Limited Analytics
Segmentation for Campaigns
Using SAS
4
Data Platform: Initial Thoughts
• Less effort required in database maintenance
• 20 to 60 Times Query Performance Improvement
• Hardware Maintenance is handled by the vendor
• Powerful data transformation and analytics
capabilities with the help of accelerators
• Fits well in a large data center
• Expensive data storage EUR 15K to 70 K per
terabyte
• Proprietary technology and hardware
• Explosive growth in data storage requirements due
to logs from online channels mean more investment
is needed in Netezza
All the trademarks and copyrights are acknowledged by the author.
5
Data Platform Target Architecture
Enterprise Data Overlay (Datastage,…)
BI Tools and Applications
Enterprise SystemsExternal Systems
Data Marts (Netezza)
Enterprise Data Warehouse (Hadoop)
Predictive Analytics Lab
(Hadoop)
6
Challenges with Hadoop in a Large
Bank
 New technology – first mover disadvantage
 Experienced people are hard to find
 Attention and hype from CXO (read pressure
to deliver)
 Do we really have a big data problem?
 No clear leader in the vendor space
 Open source and Java focused community
7
IT Challenges in a Large Bank
High End Servers Virtualization
Storage Area Networks
Shared Services
Build Server, Monitoring &
Back-up etc.
8
Our Journey
Play Area Big Data Predictive Analytics
Lab
Production
System
9
Play Area Big Data
 Goals
 Quickly learn about Hadoop Capabilities
 Create interest and awareness in the organization
 What we did?
 Setup a small Hadoop cluster with old unused HP
blades in a test area
 Get started quickly with a distribution recommended
by a consulting company
 One time old data load, no ETL, no scheduling etc.
 Small team experimenting data using R
10
Predictive Analytics Lab
 Goals
 Capability to build predictive models for business
cases
 Secure environment designed for data scientists
who can build predictive models
 What we did?
 Hadoop cluster with brand new Hardware
 Managed to install it in a data centre
 Secure and monitored
 Based upon a HDP 2.0
11
Implementation Challenges (1/3)
 Securing Hadoop
 Strong perimeter security to a limited set of users
 Multi-factor authentication
 Stepping stone to Hadoop cluster using Citrix
 Enterprise repository for deployment with pre-
screened jars for Trojans and Malware
 Monitoring for various events
12
Implementation Challenges (2/3)
 Hardware
 Data Centre not ready for cheap/commodity hardware
 Automated deployment only possible on VMs so no
install possible on bare metal
 Compromise between costs and DC standards
 Automated Provisioning of Hadoop
 Ansible for automated provisioning of 18 nodes
 Ambari for monitoring the cluster
 It is easy to automate the provision and highly
recommended
13
Implementation Challenges (3/3)
 Rapid Pace of Innovation
 Hadoop community is very active on innovation
front
 As we built our Hadoop cluster, new names such
as Spark, Accumulo and Falcon pop-up
 Infra processes are waterfall based forcing a
pause every time a new tool pops-up
 Number of distributions to choose from
14
Predictive Analytics Lab
Stepping Stone (Citrix)
18 x Hadoop Nodes
GIT, Libraries, Build
Tools
Monitoring Services
Data Files in Batches
Dedicated VLAN Shared ServicesShared Services
SMTP Relay
Internet via
Corporate
Infrastructure
Firewall Rules
Guard the Perimeter
Security
Of Hadoop Cluster
18 x Hadoop Nodes
15
Predictive Analytics Lab
 Scrum – 3 week long
sprint
 Data
Scientists, Hadoop
Engineers in the team
 Every sprint
demonstrates working
software to the
stakeholders
 Hortonworks HDP
2.0
 Hive
 Ambari and Ansible
 R Studio
 Hue, HCatalog
Team and Process Lab Environment
16
Our Journey
Production
System
Play Area Big Data Predictive Analytics
Lab
17
Production System
 Goals
 Meet diverse informations needs of business
 Deploy predictive model for production
 Cut-down data storage costs without
compromising the reliability and availability
 What we need?
 Fine grained security
 ETL and workflow tools
 Automated Deployment of Predictive Models
 Disaster recovery
18
Real-time
 Hadoop is a batch processing system not
designed for real time analytics
 A predictive model that has to perform near
real-time would require a deployment platform
different from Hadoop
 Real time means near real-time or micro-
batches
 Candidate tools for evaluation: Storm, Spark
and Infosphere Streams
19
Business Cases
Improve the segmentation
for marketing
Personal spending forecasPredict mortgage defaulters
20
Hadoop Benefits
 Data Hub or Enterprise Memory
 Schema on read
 Cheap but reliable storage
 Fault Tolerant
 Lower cost of hardware and licenses
 Data Driven Applications
 Run complex queries and predictive analytics
models
 Build predictive models
 Increase revenue and lower risk
21
Lessons learned
 Hadoop is ready for early adopters; it can save
you costs and accelerate predictive analytics
 Hadoop is not the complete solution to build a
real-time and predictive platform
 Business case driven experimentation has
greater chance of acceptance than pure
technical exploration in a large enterprise
 External expertise and close link with the
community is valuable
22
Contact Details
anurag.shrivastava@ing.nl or shri2201@gmail.com
Twitter: @shri2201
Questions23
Making Bank Predictive and Real-Time

Contenu connexe

Tendances

Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 
Georgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureGeorgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureMicrosoft
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...DataWorks Summit
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes John Archer
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIDataWorks Summit
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraCloudera, Inc.
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 

Tendances (20)

Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
Georgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft AzureGeorgia Azure Event - Scalable cloud games using Microsoft Azure
Georgia Azure Event - Scalable cloud games using Microsoft Azure
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at Scale
 
Rob Bearden Keynote Hadoop Summit San Jose
Rob Bearden Keynote Hadoop Summit San JoseRob Bearden Keynote Hadoop Summit San Jose
Rob Bearden Keynote Hadoop Summit San Jose
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 

Similaire à Making Bank Predictive and Real-Time

Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Amazon Web Services
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingAmazon Web Services
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBigDataExpo
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 

Similaire à Making Bank Predictive and Real-Time (20)

Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
SendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data WarehousingSendGrid Improves Email Delivery with Hybrid Data Warehousing
SendGrid Improves Email Delivery with Hybrid Data Warehousing
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Making Bank Predictive and Real-Time

  • 1. Making Bank Predictive and Real-Time Anurag Shrivastava ING Bank
  • 2. MAKING BANK PREDICTIVE AND REAL TIME BY ANURAG SHRIVASTAVA 2014 HADOOP SUMMIT APRIL 2-3, 2014 AMSTERDAM Hadoop: Experimentation to Production 2
  • 3. What is predictive & real-time?  Predictive  Ability to predict the behavior of customer based upon data and generation of appropriate actions  Real-Time  Ability to take appropriate actions near real-time based upon the events generated during customer interactions 3
  • 4. Data Platform: Challenges Data Silos 160 Oracle Instances, 70 TB 30 Data Warehouses Batch Processing 120000 ETL Jobs, OWB, ODI Limited Analytics Segmentation for Campaigns Using SAS 4
  • 5. Data Platform: Initial Thoughts • Less effort required in database maintenance • 20 to 60 Times Query Performance Improvement • Hardware Maintenance is handled by the vendor • Powerful data transformation and analytics capabilities with the help of accelerators • Fits well in a large data center • Expensive data storage EUR 15K to 70 K per terabyte • Proprietary technology and hardware • Explosive growth in data storage requirements due to logs from online channels mean more investment is needed in Netezza All the trademarks and copyrights are acknowledged by the author. 5
  • 6. Data Platform Target Architecture Enterprise Data Overlay (Datastage,…) BI Tools and Applications Enterprise SystemsExternal Systems Data Marts (Netezza) Enterprise Data Warehouse (Hadoop) Predictive Analytics Lab (Hadoop) 6
  • 7. Challenges with Hadoop in a Large Bank  New technology – first mover disadvantage  Experienced people are hard to find  Attention and hype from CXO (read pressure to deliver)  Do we really have a big data problem?  No clear leader in the vendor space  Open source and Java focused community 7
  • 8. IT Challenges in a Large Bank High End Servers Virtualization Storage Area Networks Shared Services Build Server, Monitoring & Back-up etc. 8
  • 9. Our Journey Play Area Big Data Predictive Analytics Lab Production System 9
  • 10. Play Area Big Data  Goals  Quickly learn about Hadoop Capabilities  Create interest and awareness in the organization  What we did?  Setup a small Hadoop cluster with old unused HP blades in a test area  Get started quickly with a distribution recommended by a consulting company  One time old data load, no ETL, no scheduling etc.  Small team experimenting data using R 10
  • 11. Predictive Analytics Lab  Goals  Capability to build predictive models for business cases  Secure environment designed for data scientists who can build predictive models  What we did?  Hadoop cluster with brand new Hardware  Managed to install it in a data centre  Secure and monitored  Based upon a HDP 2.0 11
  • 12. Implementation Challenges (1/3)  Securing Hadoop  Strong perimeter security to a limited set of users  Multi-factor authentication  Stepping stone to Hadoop cluster using Citrix  Enterprise repository for deployment with pre- screened jars for Trojans and Malware  Monitoring for various events 12
  • 13. Implementation Challenges (2/3)  Hardware  Data Centre not ready for cheap/commodity hardware  Automated deployment only possible on VMs so no install possible on bare metal  Compromise between costs and DC standards  Automated Provisioning of Hadoop  Ansible for automated provisioning of 18 nodes  Ambari for monitoring the cluster  It is easy to automate the provision and highly recommended 13
  • 14. Implementation Challenges (3/3)  Rapid Pace of Innovation  Hadoop community is very active on innovation front  As we built our Hadoop cluster, new names such as Spark, Accumulo and Falcon pop-up  Infra processes are waterfall based forcing a pause every time a new tool pops-up  Number of distributions to choose from 14
  • 15. Predictive Analytics Lab Stepping Stone (Citrix) 18 x Hadoop Nodes GIT, Libraries, Build Tools Monitoring Services Data Files in Batches Dedicated VLAN Shared ServicesShared Services SMTP Relay Internet via Corporate Infrastructure Firewall Rules Guard the Perimeter Security Of Hadoop Cluster 18 x Hadoop Nodes 15
  • 16. Predictive Analytics Lab  Scrum – 3 week long sprint  Data Scientists, Hadoop Engineers in the team  Every sprint demonstrates working software to the stakeholders  Hortonworks HDP 2.0  Hive  Ambari and Ansible  R Studio  Hue, HCatalog Team and Process Lab Environment 16
  • 17. Our Journey Production System Play Area Big Data Predictive Analytics Lab 17
  • 18. Production System  Goals  Meet diverse informations needs of business  Deploy predictive model for production  Cut-down data storage costs without compromising the reliability and availability  What we need?  Fine grained security  ETL and workflow tools  Automated Deployment of Predictive Models  Disaster recovery 18
  • 19. Real-time  Hadoop is a batch processing system not designed for real time analytics  A predictive model that has to perform near real-time would require a deployment platform different from Hadoop  Real time means near real-time or micro- batches  Candidate tools for evaluation: Storm, Spark and Infosphere Streams 19
  • 20. Business Cases Improve the segmentation for marketing Personal spending forecasPredict mortgage defaulters 20
  • 21. Hadoop Benefits  Data Hub or Enterprise Memory  Schema on read  Cheap but reliable storage  Fault Tolerant  Lower cost of hardware and licenses  Data Driven Applications  Run complex queries and predictive analytics models  Build predictive models  Increase revenue and lower risk 21
  • 22. Lessons learned  Hadoop is ready for early adopters; it can save you costs and accelerate predictive analytics  Hadoop is not the complete solution to build a real-time and predictive platform  Business case driven experimentation has greater chance of acceptance than pure technical exploration in a large enterprise  External expertise and close link with the community is valuable 22
  • 23. Contact Details anurag.shrivastava@ing.nl or shri2201@gmail.com Twitter: @shri2201 Questions23

Notes de l'éditeur

  1. My name is Anurag Shrivastava.I lead an engineering team that builds data platform and customer intelligence solutions.I work for ING bank which is a very large retail bank in the Netherlands.Stand still and talk slowly
  2. Everyday we get tens of email containing one promotion or another.We throw away most of them. However, if you get an email for special cake price two days before wife’s birthday, you are likely to be happy. Most of marketing can go waste if you do not know what you want. Right offer at wrong moment rarely works.Somebody tries to use your credit card in Australia when you are watching my presentation. How soon your credit card company know that and inform you?
  3. Data silos have come up over a period of time because of the specific needs of a value chain. This causes data duplication and several point to point interfaces.Batch processing is based upon the processing the files at intervals. Due to multiple data silos, we have to process the same file several time leading to complex ETL routines and short window for fault recovery.At this moment, analytics is limited to analytics on structured data for marketing purposes. Data is analyzed at rest. Model is built and deployed for the campaigns.We process around 1000 batch files daily.
  4. Netezza seemed to a great idea because it offered consolidation and lower maintenance overhead. It also much faster than our Oracle based DWH. However this was before Hadoop shot into prominence. This decision was taken in 2011 when we were not familiar with Hadoop. The cost of Netezza is high. We also had to redo our ETL.
  5. Explain how this stack has been built starting from bottom to top.
  6. First mover disadvantage in contrast with Oracle which is well known.BI departments are SQL focused. Stacks do not change a lot over a long period of time.Big Data companies are very small. When our CIO/CFO visited these companies in the Silicon Valley, they were surprised by their small size.
  7. Data centres have designed for High End server/ Hadoop works on cheap serversRisk of fire/Risk ot shutting down the entire networkVirtualization is used heavily. Concept of data locality are foreign here.Large IT organizations share a lot: Monitoring, Build and Backup is shared so a new system has to be compatible with it or it becomes special and cost of service goes up.A IT Infra engineer sees Hadoop as an elephant in a zoo full of tigers.
  8. Start small, learn and move on.
  9. Once you combine data from many sources, its sensitivity increasesFine grained security on Hadoop is still not ready.
  10. We settled for hardware and software recommended by our data centre though we could have saved more ( and spent more time) with cheaper hardware.We used HP SL4540. One node EUR 16000 for 16 core CPU Approx 30 TB per node.You do not want to install 18 nodes manually.
  11. Every new tool means following a cumbersome change process.We choose Hortonworks because of their clean open source approach.Innovation can be tempting but you can not implement every new tool.
  12. Explain the purpose of each block.
  13. So we can build predictive model but how to deploy them has to be figured out.Real time – yet to be done.