Apache Hadoop, an open-source platform, is increasingly gaining adoption within organizations trying to draw insight from all the big data being generated. Hadoop, and a handful of open-source tools that complement it, are promising to make gigantic and diverse datasets easily and economically available for quick analysis. A burgeoning partner ecosystem is also essential to helping organizations turn big data into business value.
9. CLOUDERA: THE STANDARD FOR
APACHE HADOOP IN THE ENTERPRISE
OMER TRAJMAN, VP CUSTOMER SOLUTIONS
10. “ YOU CAN’T SOLVE 21ST
CENTURY PROBLEMS
WITH 20TH CENTURY
TECHNOLOGIES
”
11. HOSPITALS
NEED MORE
COMPREHENSIVE
PATIENT
INFORMATION
BANKS MUST
DETECT FRAUD BROADCAST NETWORKS
FASTER WANT TO DELIVER
CUSTOMIZED CONTENT BY
HOUSEHOLD
AIRLINES WANT TO
UPDATE FLIGHT POWER COMPANIES
PRICES IN REAL- WANT TO SAVE
TIME CUSTOMERS MONEY BY
ANALYZING
USAGE DATA
OIL COMPANIES
WANT TO PREDICT
THE LOCATION OF
DEPOSITS MORE
ACCURATALY
RETAILERS WANT TO PARTICLE
CREATE MORE PHYSICISTS WANT
TARGETTED OFFERS REAL-TIME DATA
TO CUSTOMERS FROM THE HADRON
COLLIDER
12. SCIENTIFIC APPROACH
TO DATA REQUIRES…
STORAGE FORMATS
FLEXIBILITY
EXTENSIBILITY
COMPACT STORAGE
FAST LOAD/STORE
WIDELY SUPPORTED
13. SIX CHARACTERISTICS OF
ENTERPRISE-GRADE HADOOP
1 HIGH
AVAILABILITY 2 GRANULAR
SECURITY
THERE’S NO DOWNTIME. YOUR DATA IS PROCESS AND CONTROL SENSITIVE
ALWAYS AVAILABLE FOR DECISIONS DATA WITH CONFIDENCE
3 ROBUST
MANAGEMENT 4 SCALABLE AND
EXTENSIBLE
ACHIEVE OPTIMAL PERFORMANCE VIA ADAPTS TO YOUR WORKLOAD AND
CENTRALIZED ADMINISTRATION GROWS WITH THE BUSINESS
5 CERTIFIED AND
COMPATIBLE 6 GLOBAL SUPPORT
AND SERVICES
EXTEND AND LEVERAGE EXISTING ACHIEVE SLAs AND ADHERE TO
INFRASTRUCTURE INVESTMENTS EXISTING IT POLICIES
14. HADOOP PROVIDES A DATA HUB FOR ALL BIG DATA WORKLOADS
• Brings storage and computation together in one single system
• Works with every type of data in its native format
• Changes the economics of data management
15. APACHE HADOOP
CO-EXISTS WITH EDW, ETL & BI TOOLS
Consulting Services
Cloudera University Cloudera Services
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS
Cloudera Enterprise
Management Cloudera Manager Enterprise Web
Cloudera Support IDE’s BI / Analytics
Tools Reporting Application
Enterprise Data
Warehouse
Cloudera’s Distribution
Including Apache Hadoop (CDH)
& Operational Rules
Cloudera Manager Free Edition Engines
Relational
Logs Files Web Data
Databases
16. CLOUDERA’S PARTNER ECOSYSTEM:
WIDEST INTEGRATION
All the industry leaders develop on CDH.
CDH4
STORAGE COMPUTATION ACCESS INTEGRATION
Big Data storage, processing and analytics platform based
on Apache Hadoop – 100% open source
BI / Analytics Data Integration Database OS / Cloud / Sys Mgmt Hardware
16
19. Why Hadoop, Why Cloudera, Why Now?
Agenda
✛ RH overview
✛ What is our need
✛ Why our system/data is complicated
✛ How Hadoop meets our needs
20. McKesson Corporation
✛ Largest healthcare company in the world
$103+billion in revenues; Fortune 15; S&P 500
Est. 1833
Headquarters: San Francisco
✛ Business
Distribution Solutions
Technology Solutions
✛ Extensive resource base
32,000+ employees solely dedicated to healthcare
✛ Comprehensive array of solutions
Significant value through a single relationship
✛ Broadest customer base in healthcare
Experienced partners in improving healthcare
21. Overview of Financial Solutions
200,000
Physicians 1900
2,000 Payers /
Hospitals Health Plans
Provider-to-Payer Interactions
Total Interactions: 2.4 Billion/Year
22. Business Challenges
✛ Help customers save money
✛ Small reductions to time in AR
big savings, better cash flow
✛ Meet regulatory challenges
> Must store 7 years transactional
data
23. What Big Data Means to RelayHealth
Every single day:
+ millions of transactions generated
+ thousands of files received
+ 150GB+ log data collected
…to be stored for 7 years
24. Why RelayHealth Considered Hadoop
✛ Business requirement around data storage & retrieval
✛ Looked at traditional solutions
Database
File System
$$$;
Untenable when
Not easy to
searching
index files
Hybrid
(File System + Solr)
Not scalable
25. Achieving Operational Efficiency with Hadoop & Cloudera
✛ Why Hadoop? ✛ Why Cloudera?
> Store billions of files across > Core Apache Hadoop
machines leveraging OSS community
> Mine data in files using M/R > Integration with other open
source solutions:
> Aggregate log data & search HBase, Solr, Camel
through it using unique
> Committer level knowledge of
customer identifying
information code & how it works
> World-class support
> Store data in its highest
fidelity state > Cloudera Manager
26. Changing Perception
✛ Simple archive vs. a way to share data across the organization
✛ Building the ability to collect data flowing through our system at all
points needed
✛ Integrating CDH into the rest of the enterprise
> Storing data in its highest fidelity state
> Moving away from traditional warehousing systems
> Ability to distill data in the cluster for mining in other systems – CDH
connectors
27. Summary
✛ Challenge: ✛ Solution:
✛ Shorten healthcare providers’ ✛ Hadoop
payment cycles via scalable, flexible data
streamlined message processing & analysis on
processing multi-structured data
✛ RDBMS can’t keep up ✛ Cloudera Enterprise
with growing data adding
volumes + data storage expertise, support &
mandates for regulatory management tools to
compliance open source Hadoop
29. REGISTER NOW FOR THE REMAINING
‘POWER OF HADOOP’ WEBINARS:
THANK WHAT THE HADOOP: WHY YOUR BUSINESS CAN’T
YOU!
AFFORD TO IGNORE THE POWER OF HADOOP
GIGAOM PRO AND CLOUDERA
WEDNESDAY, AUGUST 29, 10AM PST
THE BUSINESS ADVANTAGE OF HADOOP:LESSONS
FROM THE FIELD
451 RESEARCH AND CLOUDERA
THURSDAY, SEPTEMBER 26, 10AM PST
29
Notes de l'éditeur
http://www.flickr.com/photos/ychi2010/6769591849/sizes/m/in/photostream/For decades companies have been making decisions based on transactional data stored in relational databases, Beyond that data is a potential treasure trove of non-traditional, less structured data that can be mind for useful insight. Decreases in the cost of storage and compute power have made it feasible to collect this data – which would have been thrown away only a few years ago. As a result, more and more companies are looking to include non-traditional yet potentially valuable data with their traditional enterprise data in the analysis proceses.
FALLBACK
Data science involves looking at data differently. Rather than creating a uniform schema (rows and columns), tools like Hadoop give data scientists the flexibility to store data in a format that fits the question we're trying to answer. This requires an underlying system that's flexible. A system that can store and process any type of data, starting with it's original raw format and allowing scientists to transform and apply a schema to suit the particular problem.Data scientists use tools and technologies that can read and write data in compact storage, are fast to read and write and can be accessed from a wide variety of languages.We use libraries such as Avro, which gives flexibility to structure and process data.
Standard pitch from CDH4 launch…When we talk about bringingHadoop to the enterprise, there are six essential characteristics or areas that we focus on.High Availability – most customers want to use Hadoop to power mission critical applications and workflows. As such the system must run with maximum uptime to keep all data and processes available to the business.Granular security – enterprises require the ability to secure sensitive data types as well as control who has access to system resources and when. Cloudera works with the open source community to build these capabilities into the platform and provides simple configuration and enforcement through our management application.Robust Management – Hadoop is a distributed system with many moving parts. Centralized management is critical for successful implementationScalable and Extensible – one of the great things about Hadoop is it’s massive scalability. We want to make it easy for you to take advantage of this by integrating your applications with the platform.Certified and Compatible – Enterprises have invested significant amounts of time and money into their existing infrastructure (data warehouses, BI applications, etc.). We want to make sure that Hadoop integrates seamlessly with those technologies.Global Support and Services – As Hadoop becomes a critical component of the data management infrastructure, we want to empower our customers to meet stringent service level agreements and build out their own Hadoop workforce.
Hadoop is an open-source framework for running applications on large clusters of commodity hardware. As a result, it delivers enormous processing power and the ability to handle virtually limitless concurrent tasks and jobs, making it remarkably low-cost complement to traditional enterprise data infrastructure. Organizations use Hadoop in 5 ways. 1) staging area for data warehouse and analytics store, 2) initial discovery and analysis, 3) storage and analysis of unstructured/semistructured content, 4) making total data available for analysis, 5) low cost storage of large data volumes.With traditional database and data analytics tools, information is stored in neat rows and columns, and there are limits to how much data you can juggle and how quickly. The Hadoop Distributed File System provides an environment to exploit massively parallel processing against large amounts of data. Hadoop changes the dynamics of large scale computing. With Hadoop, you can distribute raw data across a vast cluster of low-cost machines, and you can process that data in the same place you store it. The result is that you can store all your data and analyze it as needed. A paradigm shift - merging the power of analytics with the power of Hadoop data storage and processing to get better answers faster. This new paradigm will significantly improve an organization’s ability to assimilate vast data assets and give them the compute and analytical power to tackle problems/opportunities they never thought possible. As businesses become more analytical to gain competitive advantage and comply with new regulations, enterprise data warehouses are pushed to answer more ad-hoc questions from more people analyzing vastly larger volumes of data, often in real-time. Hadoop and next-gen analytic platforms are fundamental building blocks of the architecture needed to compete effectively in a data-driven world. Hadoop is the next wave of strategic enterpriseinformation management. THE ‘BIG DATA’ SHIFT“Big Data analysis is usually iterative: you ask one question or examine one data set, then think of more questions or decide to look at more data. That’s different from the “single source of truth” approach to standard BI and data warehousing.” — PwC 2010 Technology Forecast-----------------------------------------BRINGS STORAGE AND COMPUTATION TOGETHER IN A SINGLE SYSTEMPROCESS & ANALYZE DATA IN PLACEREMOVE NETWORK BOTTLENECKSELIMINATE DATA MIGRATIONSWORKS WITH EVERY TYPE OF DATA, IN ITS NATIVE FORMATNO NEED TO FIT A SINGLE SCHEMANOTHING LOST THROUGH ETLLOOK AT ALL YOUR DATA FOR A COMPREHENSIVE VIEWCHANGES THE ECONOMICS OFDATA MANAGEMENTOSS + COMMODITY HARDWAREKEEP EVERYTHING ONLINE SUPERCOMPUTING FOR EVERYONE
Hadoop is not a single entity. It is a rich, complex, and evolving ecosystem of multiple open source products from Apache. In addition, the ecosystem expands almost daily as more open source and vendor products support or extend Hadoop products and technical approaches.We are a platform company. Within our partner ecosystem you get everything you need to leverage big data. Hadoop is now a 1st class citizen in the enterprise IT department. With so many key IT vendors “attaching to Hadoop” via the Cloudera Connect program, the penetration of Hadoop related technologies into the heart of the enterprise analytics environment is acceleratedCoordinating your traditional and Big Data processes takes a vendor that understands the legacy and modern approach to data processing Cloudera is differentiated by its combination of platform + methodology + ecosystem. (methodology = data computing)
The possibilities of big data continue to evolve rapidly, driven by innovation in the underlying technologies, platforms, and analytic capabilities for handling data, as well as the evolution of behavior among its users as more and more individuals live/work digital lives. To evolve into an organization that is “data-driven” and competes on data, the business must make better use of data as it moves through daily operations which demands a radical rethinking of traditional data warehousing and transaction processing. Hadoop leverages several resources that have been outside the information architectures we have today. It is bringing in new programming languages, new skills and new data and being deployed as a new platform. Think how it is used to extend/supplement how we leverage information, synergistic if we put the pieces together right. What is possible now that so many of the constraints are removed?
Business Challenges:We need to use all the data we collect to help our customersSmall reductions to time in AR lead to big savings and better cash flowRelay has an existing suite of Analytics products, but we always want to do more This means keeping data at much higher fidelityRegulatory challengesNeed to store these transactions to meet regulatory compliance
Storage of transaction dataMillions of transactions per dayThousands of files coming in as well as data flowing through web service and direct connection requestsStorage of log dataAverage over 150 GB of log data collected per day Data is used for troubleshooting customer issues and may be used 30 to 60 days after it is collected
Project in place to meet business requirement around storage and retrieval of dataLooked at traditional solutionsDatabase – too costly, would not allow for easy indexing of filesFile system – Using enterprise standards, (lots of CPUs and SAN), proved to be untenable when searchingHybrid – File system + Solr. Did not investigate very thoroughly as there were issues around working with that volume of data