Contenu connexe
Similaire à Ibm leads way with hadoop and spark 2015 may 15
Similaire à Ibm leads way with hadoop and spark 2015 may 15 (20)
Plus de IBMInfoSphereUGFR
Plus de IBMInfoSphereUGFR (8)
Ibm leads way with hadoop and spark 2015 may 15
- 1. © 2015 IBM Corporation
IBM Leads the Way with Hadoop and Spark
The Keys to Getting Value out of Big Data
- 2. © 2015 IBM Corporation2
IBM’s Framework for Getting Value out of Big Data
All agree on Big Data’s potential, but wide divergence on how to exploit it
Pioneers who have started to harness Big Data have benefited greatly
We see Big Data adoption as a continual process – maturity levels
IBM’s approach enables faster adoption of Big Data technologies
Open source innovation (Hadoop, Spark)
Standards-based technologies (ODP, SQL, R)
Familiar interfaces and integration with established tools (IBM innovations)
Advanced analytics (IBM innovations)
IBM’s commitment for continued innovation
- 3. © 2015 IBM Corporation3
Hadoop and Spark Offer Significant Business Benefits
Operations Data Warehousing Line of Business
and Analytics
New Business
Imperatives
Big Data Maturity High
High
Low
Data-Informed
Decision Making
• Full dataset analysis
(no more sampling)
• Extract value from
non-relational data
• 360
o
view of all
enterprise data
• Exploratory analysis
and discovery
Warehouse
Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive
and staging
Lower the Cost
of Storage
Business
Transformation
• Create new business
models
• Risk-aware decision
making
• Fight fraud and
counter threats
• Optimize operations
• Attract, grow, retain
customers
Value
- 4. © 2015 IBM Corporation4
IBM Investing in Four Catalysts for Big Data Adoption
Familiar Interfaces & Integration
with Established Tools
Open Source Innovation Technical Standards
New Analytics Capabilities
- 5. © 2015 IBM Corporation5
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
Hadoop Advantages
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
- 6. © 2015 IBM Corporation6
Hadoop MapReduce Challenges
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk
with each cycle
• Only suitable for batch
workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
- 7. © 2015 IBM Corporation7
In-Memory Performance
Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
Spark Advantages
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows
- 8. © 2015 IBM Corporation8
Spark Libraries
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
- 9. © 2015 IBM Corporation9
Spark on Hadoop
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Apache Hadoop-HDFS
Apache Hadoop-YARN
Resource
management
Storage
management
Compute
layer
Slave node 1 Slave node 2 Slave node n…
- 10. © 2015 IBM Corporation10
Spark on Mesos
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Apache Hadoop-HDFS
Apache Mesos
Resource
management
Storage
management
Compute
layer
Slave node 1 Slave node 2 Slave node n…
- 11. © 2015 IBM Corporation11
Spark as a Service
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Amazon S3
Resource
management
Storage
management
Compute
layer
Apache Hadoop-YARN
Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…
- 12. © 2015 IBM Corporation12
Spark on the Amazon Cloud
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Amazon S3
Resource
management
Storage
management
Compute
layer
Apache Hadoop-YARN
Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…
- 13. © 2015 IBM Corporation13
Spark Running in Standalone Mode
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Single node, with local storage
Resource
management
Storage
management
Compute
layer
- 14. © 2015 IBM Corporation14
Spark Resilient Distributed Datasets
Slave node 1
c3 d2
a2 b1
partition3
partition1
partition2
Slave node 2
c2 d1
a1 b2
partition1
partition3
Slave node 3
c1 d2
a3 b3
partition2
partition2
partition1
RDD1
RDD2
RDD3
Spark RDD
In-memory distribution
HDFS
On-disk distribution
- 15. © 2015 IBM Corporation15
The Combination: The Flexibility of Spark on a Stable Hadoop Platform
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
- 16. © 2015 IBM Corporation16
IBM Open Platform with Apache Hadoop
100% open source code
Commitment to currency: “days, not months”
Includes Spark
Free for production use
Decoupled Apache Hadoop from IBM analytics and data science technologies
Production support offering available
Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
IBM Open Platform with Apache Hadoop
- 17. © 2015 IBM Corporation17
IBM is Committed to Open Source
Open source technologies are the base for IBM software and solutions
IBM’s long history of deep open source commitment
Apache Software Foundation: Founding member in 1999
Cloud Foundry: #1 contributor; Basis for Bluemix
OpenStack: #4 contributor; Basis for IBM’s IaaS
Linux: #3 contributor; IBM first enterprise backer of Linux
Hadoop/Spark: Extensive investment in open source contribution; Integration with
Analytics software
Infrastructure
Systems
Application
- 18. © 2015 IBM Corporation18
Goal of the Apache Software Foundation: Let 1000 Flowers Bloom!
• 249 Top Level Projects, 40 Incubating
• 2 Million+ Code Commits
• IBM co-founded the ASF in 1999 and
is a Gold Sponsor
• The “Apache Way” is about fostering
open innovation
• Not a standards organization
- 19. © 2015 IBM Corporation19
Apache Hadoop Ecosystem: Rapid Innovation, Few Standards
Distributions include different projects at different version levels
“This proliferation of baskets [Hadoop distributions with different project versions] creates significant drag
when it comes to building reliable applications ... makes it harder for customers to assess which basket of
Hadoop that they need and harder for application developers to create solutions that work broadly.”
– Raymie Stata, CEO, Altiscale
Even though the project versions match, there are interface differences
“Setting a baseline of Hive 13 so we get access to some new syntax. Try it on one, it works great... Try it
on another that says it also has Hive 13, and we get ‘syntax error’ …”
- Craig Rubendall, VP, SAS
If the industry is truly committed to developing big data technologies and solutions …, it will require an
ecosystem of providers … to create a consistent framework around which everyone can develop.
- Siki Giunta, SVP, Verizon
The Hadoop ecosystem is evolving at a faster pace than is comfortable
“My personal speculation is that it comes from some who have been evaluating for a while seeing
change occur so rapidly that they are dropping back for another look.”
– Merv Adrian, VP, Gartner
- 20. © 2015 IBM Corporation20
Certify a standard “ODP Core” set of
open source Hadoop family projects
with specific versions and patch levels
Develop tools and methods to help
solution providers to test applications
against the ODP Core.
Contribute changes and fixes in the
ODP Core Hadoop family projects to
the ASF using the ASF processes.
http://opendataplatform.org/
- 21. © 2015 IBM Corporation21
Open Data Platform Initiative
Representation across the
Hadoop ecosystem…
• Hadoop distribution vendors
• Software application providers
• System integrators/consultants
• Hardware vendors
• Customers
… who all believe in the need for a community-based effort to
standardize Hadoop, which will lead to improved adoption
- 22. © 2015 IBM Corporation22
IBM Open Platform with Apache Hadoop adopts ODP Core
BigInsights will include ODP certified Apache packages
ODP will initially target core packages of a Hadoop distribution
Packages will expand over time
First certification set expected this summer
Our goal for BigInsights on ODP
Better compatibility and less testing against ecosystem software
Enable IBM Hadoop capabilities to run on other ODP-certified Hadoop
distributions
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
ODP
* Candidate set of certified ODP modules – expected summer 2015
Apache Open Source Components
IBM Open Platform with Apache Hadoop
- 23. © 2015 IBM Corporation23
Goal of the ODP: Enable Innovation to Flourish on a Common Platform
• Complements the Apache Software
Foundation’s governance model
• ODP efforts focus on integration,
testing, and certifying a standard core
of Apache Hadoop ecosystem projects
• Fixes for issues found in ODP testing
will be contributed to the ASF projects
in line with ASF processes
• The ODP will not override or replace
any aspect of ASF governance
- 24. © 2015 IBM Corporation24
Text Analytics
POSIX Distributed File System
Multi-workload, Multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning
with Big R
Big R
IBM Open Platform with Apache Hadoop
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Big SQL
BigSheets
for Apache Hadoop
IBM BigInsights for Apache Hadoop
- 25. © 2015 IBM Corporation25
IBM BigInsights for Apache Hadoop
IBM System zIBM PowerIntel Servers On Cloud
Your choice of infrastructure and deployment model
- 26. © 2015 IBM Corporation26
IBM Analytic Platform Capabilities
IBM Software Integrates and Extends Hadoop and Spark
Data Warehousing
PureData for Analytics, Operational Analytics
Entity Extraction and Matching
Big Match
Security and Compliance
Optim, Guardium Audit and Encryption
Data Integration and Governance
Information Server
Enterprise Search
Watson Explorer
Real-time Analytics
Streams
Predictive Modeling and Descriptive Statistics
SPSS, Big R and Scalable Algorithms
Analysis, Reporting, and Exploration
Watson Analytics, Cognos, BigSheets
Fast, ANSI SQL 2011, and Secure SQL
Big SQL
Enterprise File System
GPFS-FPO
Cluster Resource and Workload Management
Platform Symphony
Large Scale Text Extraction
Big Text
IBM Open Platform with Apache Hadoop
- 27. © 2015 IBM Corporation27
IBM Leads the Market and Analysts Agree
“IBM’s all-in bet on Apache Hadoop clearly has had the
biggest impact among developers we polled”
- Evans Big Data Survey
Leading Hadoop Distribution Leading Streaming Analytics Solution
- 28. © 2015 IBM Corporation28
IBM’s Investment in the Big Data Community
Over 250,000 benefit from free Big Data skills training
http://bigdatauniversity.com
- 29. © 2015 IBM Corporation29
Spark Technology Center
Focal point for IBM investment in Spark
Code contributions to Apache Spark project
Build industry solutions using Spark
Evangelize Spark technology inside/outside IBM
Agile engagement across IBM divisions
Systems: contribute enhancements to Spark core, and optimized
infrastructure (hardware/software) for Spark
Analytics: IBM Analytics software will exploit Spark processing
Research: build innovations above (solutions that use Spark), inside
(improvements to Spark core), and below (improve systems that execute
Spark) the Spark stack
Goal: To be the #1 contributor and adopter in the Spark ecosystem
- 30. © 2015 IBM Corporation30
The IBM Difference
IBM delivers the foundation for Big Data – now and in the future
Embraces open source
Establishes standards
Integrates with familiar interfaces and established systems
Delivers advanced analytic capabilities
Enables you to benefit from broader data and analytics capabilities
Data Integration and Governance
Predictive and Real-time Analytics
Provides expertise to help you on your journey
6,000 partners
Analytics services and solution centers