Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
11. Top challenges for the big data
warehouse
68%
60%
52%
47%
40%
32%
1%
0% 18% 35% 53% 70% 88%
Ensuring the quality of the data (accuracy,
completeness, consistency)
Complying with security and data privacy policies
Keeping data flow pipelines operating effectively
Building pipelines for getting data into the data
store
Upgrading big data infrastructure components
(Kafka, Hadoop, etc.).
Adapting pipelines to meet new requirements
We have no challenges
What challenges does your company face when
managing your big data flows?
12. What’s the impact?
Yes
87%
No
13%
Yes
74%
No
26%
Does ‘bad data’ occasionally
get into your data stores?
Do you believe there is any
‘bad data’ in your data
stores currently?
In response…
53% change data
flow pipelines at
least several
times a month
13. New standards for data warehousing
ETL ETL
Ingest Analyze
Past (ETL)
➢ Fixed schema ETL for Data
warehouses
➢ Source Data structured and
rigid transaction data
Data Sources Data Stores Data Consumers
Emerging (Ingest)
➢ Explosion of Data Stores –
fluid infrastructure
➢ Source Data predominantly
multi-structured interaction
data
➢ Data Drift: Structure,
Semantic, Infrastructure
14. Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data Trust &
Quality
Data Drift
Custom code
Fixed-schema
15. Trusted & Timely
Insights
Data KPIs
(Trusted High
Quality Data)
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data Drift
Intent-Driven
Drift-Handling
16. Think of dataflows as cyclical
processes
Build
Development
processes are far more
complex and drawn out
than they need to be
Execute
The economics of data
have changed, giving
way to a choice of
executing and
deployment options
Operate
Architectures are
constantly changing
and have more
stringent SLA’s
17. Build
Not all
developers are
created equally
>_
Integrations are
abundant and
unnecessarily rigid
Build-to-deploy takes
far longer than
necessary
18. Execute
Multiple deployment
options exist yet
constraints limit making
use of them
Mixed workloads are the
norm, must handle both
batch and streaming
11001001001001101001
00101010010010010010
10100100100101010101
01001001001010100100
11010001110010100100
10010010100101110101
Scalability is a must, both
today and into the future
19. Operate
Increasingly, the business
expects SLA’s on the
quality and timeliness of
data
Architectures are
constantly evolving, with
new versions or new
projects regularly being
added
Data, and it’s structure,
will inevitably change,
causing wide spread
impact
20. StreamSets Data Operations
Platform
EFFICIENCY
Intent Driven Flows
Batch & Streaming Ingest
In-stream Sanitization
MASTER
Availability & Accuracy
Proactive Remediation
MEASURE
Any Path
Any Time
MAP
Dataflow Lineage
Live Data Architecture
CONTROL
Drift Handling
Stage & Flow Metrics
Lineage & Impact Analysis
AGILITY
Flexible deployment
Exception Handling
Seamless Evolution
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
CloudClusterStandalone
StreamSets Data Collector Dataflow Performance Manager
Edge
22. StreamSets & MapR enable real-
time streams
Operational
Database
Operational
Database
Operational
Database
Change Data
Capture
Change Data
Capture
Change Data
Capture
Data
Exploration
Using Drill
Event Streaming
with MapR-ES
Transformations
Stream Processing
Real-time
Business
Intelligence
Static Data
– Inserts
only
Frequently
updated
data
Now, let’s look at how these events are generally analyzed today. Many customers are using batch oriented analysis for many critical business decisions.
History is repeating itself
Past: 70% of data warehouse projects used to fail. Fixed-schema ETL technology came along and automated what was previously a manual and brittle task.
Future: Explosion of Big Data apps, tools and techniques. Tied to specific data stores (fluid & multiplying). Inherent schema-centricity of legacy ETL tools prevent them from being used in the extracting and loading (now called ingest) of semi-structured data. Organizations have resorted to manually-coded data ingest pipelines. Manually-coded pipelines are unsustainable, but more importantly fail due to drift.
You have been hearing about the business impact of big data applications for half a decade now. Commonality is in the source of data…while the previous decades of applications focused on transaction data, the emerging use case focuses on event and interaction data...these sources are not just databases and apps, but rather logs, devices, and device data. Big data sources (e.g. systems, sensors) suffer from data drift--the unending, unpredictable and unannounced mutation of data caused by the operations, maintenance and modernization of data sources
Today data is delivered to data stores by writing low-level code to transport mechanisms such as Sqoop, Flume and Kafka.…create big problems in data stores
Brittle: Data flows break frequently because low-level code can’t adapt when structure changes.
Opaque: These problems manifest themselves as surprises because there is no visibility into the health of data flows or the data being delivered.
Adhoc: Data integrity corrodes as the meaning of the data changes without detection and new or changed fields do not get properly processed.
…with serious business implications
Poor business decisions get made based on incomplete, inaccurate or late data
Trust in the data is lost as these errors are discovered post hoc
Productivity and agility is sacrificed as data engineers and scientists spend all of their time fixing pipelines, doing janitorial work and forensics.
You have been hearing about the business impact of big data applications for half a decade now. Commonality is in the source of data…while the previous decades of applications focused on transaction data, the emerging use case focuses on event and interaction data...these sources are not just databases and apps, but rather logs, devices, and device data. Big data sources (e.g. systems, sensors) suffer from data drift--the unending, unpredictable and unannounced mutation of data caused by the operations, maintenance and modernization of data sources
Today data is delivered to data stores by writing low-level code to transport mechanisms such as Sqoop, Flume and Kafka.…create big problems in data stores
Brittle: Data flows break frequently because low-level code can’t adapt when structure changes.
Opaque: These problems manifest themselves as surprises because there is no visibility into the health of data flows or the data being delivered.
Adhoc: Data integrity corrodes as the meaning of the data changes without detection and new or changed fields do not get properly processed.
…with serious business implications
Poor business decisions get made based on incomplete, inaccurate or late data
Trust in the data is lost as these errors are discovered post hoc
Productivity and agility is sacrificed as data engineers and scientists spend all of their time fixing pipelines, doing janitorial work and forensics.
Key point: With the right approach, ingest can happen far more effectively and efficiently that before
Sub point 1: Not everyone is a developer
On one hand we’re extremely lucky: we’re in a market where there’s seemingly an endless number of choices for solving our various data problems. The tricky part is many of them are rather technical in nature, requiring developing new skills or seeking out hard to find resources (ie.personnel) to make use of them. While many folks thrive on being a hard core developer, many others do not, a lot of times simply because it’s faster to use simplified tooling in order to complete a project faster. The point here is you should not be constrained from taking advantage of new technologies if you lack the skills, and your adoption doesn’t need to take as long as it is if you don’t want it to.
Sub point 2: Integrations are abundant and unnecessarily rigid
Capturing all existing data on customers into the data lake. ERP, SFDC, etc. Mapping out a picture of the customer to feed other use cases. This feeds “lead information” to SFDC for sales reps to understand where the opportunities are.
Supply chain analytics is another use case with Cisco – they subscontract information out and use this to manage quality of the supply chain process. Fix issues early in the manufacturing process from sub-contractors and get in front of it for the customer. Have saved $Million in SCM efficiency.
Global bank fraud costs $200B annually)
Zions Bank Fights Fraud, Gains Insights and Cuts Data Storage Costs with MapR
The Business
Zions Bank, based in Salt Lake City, Utah, is a subsidiary of Zions Bancorporation that operates more than 500 offices and 600 ATMs in 10 Western U.S. states. As a full-service bank, Zions offers commercial, installment and mortgage loans; trust services; foreign banking services; electronic and online banking services; automatic deposit and nationwide banking and transfer services; as well as checking and savings programs.
Challenge
“Being a financial institution, we have a bull’s-eye painted on our backs,” says Michael Fowkes, Zions Bank SVP Fraud Operations and Security Analytics. “Crooks want to steal money, and banks are often a target, so fraud protection is critical to our business. If fraud gets out of control, it eats into our profitability.”
The Zions Bank Fraud Operations and Security Analytics team maintains data stores, builds statistical models to detect fraud, and then uses these models to data mine and evaluate suspicious activity.
Zions has been refining their solution over the past 8-9 years. Fowkes explains that about eight years ago they found that when they loaded in a lot of data, performance degraded significantly when they tried to do reporting.
“We always kept our eye out for new data stores. When it came time to refresh our data stores, we decided to go to Hadoop,” says Fowkes.
MapR Solution
Zions Bank chose MapR for its security features, NFS mountable file system, high availability, ease of management and its superior performance capabilities, which allow for a more efficient use of hardware and a better ROI.
The bank relies on MapR for a critical part of their security architecture. MapR helps Zions predict phishing behavior and payments fraud in real time and minimize their impact. With MapR, Zions can run more detailed analytics and forensics.
Benefits
The bank has seen multiple benefits from their MapR solution:
Cuts storage costs in half
Zions is seeing significant benefits from a storage perspective. With their other data sources, they had to hold on to source data sets so they still have the original data. MapR eliminates the need to have multiple data sources.
“When we cut over to MapR, we cut our expenses in half from a data storage perspective,” says Fowkes. (Michael, do you need to get clearance on this quote?)
Cost effective to scaleSince MapR scales linearly, capacity planning is much easier. “We know that growth won’t be incredibly expensive like with distributed database platforms which charge per terabyte of storage. This can get quite expensive,” says Fowkes. “The others cost a lot more to scale. MapR allows us to scale at a reasonable price.”
<Michael, can you provide any specific metrics about the difference in cost to scale with the MapR solution? >
Increases accuracy, speed and insightsFowkes explains that before, when you created a statistical model, you had to use sample data. “MapR allows you to wrangle large amounts of data,” he says.
“You can use all of your data and create a more accurate model. This is also used in forensics so we have one place to research what happened.”
Two years of data add up to about 1.2 petabytes of data. Wrangling this amount of data used to be daunting. “In the past, it could take a full day. Now we can do a data query of two years of data in 30 minutes,” he says.
Multiple uses for data storesCentralizing data stores serves multiple uses—from data security to fraud detection to risk management to customer marketing. “We initially got into centralizing all of our data from an information security perspective. We then saw that we could use this same environment to help with fraud detection,” he says.
“Now that we have this data we know we can do more with it. Right now we’re working on a business project on the marketing side, completely outside of fraud and info security. It’s the same data to look at on the business side for customer analytics,” he says. “And our risk group leverages data that’s used in the system too. Having a more granular view of data, you get additional insights.”
Summary
MapR is enabling Zions Bank to improve its security infrastructure while reducing costs. They’ve been able to cut storage costs in half, scale their solution cost-effectively, make more efficient use of hardware, make statistical models more accurate, increase the performance and speed of high volume data queries, generate deeper insights and help them leverage their data stores across several aspects of the business.