Ibm leads way with hadoop and spark 2015 may 15

© 2015 IBM Corporation
IBM Leads the Way with Hadoop and Spark
The Keys to Getting Value out of Big Data

© 2015 IBM Corporation2
IBM’s Framework for Getting Value out of Big Data
 All agree on Big Data’s potential, but wide divergence on how to exploit it
 Pioneers who have started to harness Big Data have benefited greatly
 We see Big Data adoption as a continual process – maturity levels
 IBM’s approach enables faster adoption of Big Data technologies
 Open source innovation (Hadoop, Spark)
 Standards-based technologies (ODP, SQL, R)
 Familiar interfaces and integration with established tools (IBM innovations)
 Advanced analytics (IBM innovations)
 IBM’s commitment for continued innovation

Hadoop and Spark Offer Significant Business Benefits
Operations Data Warehousing Line of Business
and Analytics
New Business
Imperatives
Big Data Maturity High
High
Low
Data-Informed
Decision Making
• Full dataset analysis
(no more sampling)
• Extract value from
non-relational data
• 360
o
view of all
enterprise data
• Exploratory analysis
and discovery
Warehouse
Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive
and staging
Lower the Cost
of Storage
Business
Transformation
• Create new business
models
• Risk-aware decision
making
• Fight fraud and
counter threats
• Optimize operations
• Attract, grow, retain
customers
Value

IBM Investing in Four Catalysts for Big Data Adoption
Familiar Interfaces & Integration
with Established Tools
Open Source Innovation Technical Standards
New Analytics Capabilities

• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
Hadoop Advantages
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats

Hadoop MapReduce Challenges
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk
with each cycle
• Only suitable for batch
workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows

Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
Spark Advantages
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows

Spark Libraries
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR

Spark on Hadoop
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Apache Hadoop-HDFS
Apache Hadoop-YARN
Resource
management
Storage
management
Compute
layer
Slave node 1 Slave node 2 Slave node n…

Spark on Mesos
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Apache Hadoop-HDFS
Apache Mesos
Resource
management
Storage
management
Compute
layer
Slave node 1 Slave node 2 Slave node n…

Spark as a Service
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Amazon S3
Resource
management
Storage
management
Compute
layer
Apache Hadoop-YARN
Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…

Spark on the Amazon Cloud
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Amazon S3
Resource
management
Storage
management
Compute
layer
Apache Hadoop-YARN
Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…

Spark Running in Standalone Mode
Apache Spark
Spark SQL
Spark
Streaming
GraphX MLlib SparkR
Single node, with local storage
Resource
management
Storage
management
Compute
layer

Spark Resilient Distributed Datasets
Slave node 1
c3 d2
a2 b1
partition3
partition1
partition2
Slave node 2
c2 d1
a1 b2
partition1
partition3
Slave node 3
c1 d2
a3 b3
partition2
partition2
partition1
RDD1
RDD2
RDD3
Spark RDD
In-memory distribution
HDFS
On-disk distribution

The Combination: The Flexibility of Spark on a Stable Hadoop Platform
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats

IBM Open Platform with Apache Hadoop
 100% open source code
 Commitment to currency: “days, not months”
 Includes Spark
 Free for production use
 Decoupled Apache Hadoop from IBM analytics and data science technologies
 Production support offering available
Apache Open Source Components
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene

IBM is Committed to Open Source
 Open source technologies are the base for IBM software and solutions
 IBM’s long history of deep open source commitment
 Apache Software Foundation: Founding member in 1999
 Cloud Foundry: #1 contributor; Basis for Bluemix
 OpenStack: #4 contributor; Basis for IBM’s IaaS
 Linux: #3 contributor; IBM first enterprise backer of Linux
 Hadoop/Spark: Extensive investment in open source contribution; Integration with
Analytics software
Infrastructure
Systems
Application

Goal of the Apache Software Foundation: Let 1000 Flowers Bloom!
• 249 Top Level Projects, 40 Incubating
• 2 Million+ Code Commits
• IBM co-founded the ASF in 1999 and
is a Gold Sponsor
• The “Apache Way” is about fostering
open innovation
• Not a standards organization

Apache Hadoop Ecosystem: Rapid Innovation, Few Standards
 Distributions include different projects at different version levels
“This proliferation of baskets [Hadoop distributions with different project versions] creates significant drag
when it comes to building reliable applications ... makes it harder for customers to assess which basket of
Hadoop that they need and harder for application developers to create solutions that work broadly.”
– Raymie Stata, CEO, Altiscale
 Even though the project versions match, there are interface differences
“Setting a baseline of Hive 13 so we get access to some new syntax. Try it on one, it works great... Try it
on another that says it also has Hive 13, and we get ‘syntax error’ …”
- Craig Rubendall, VP, SAS
If the industry is truly committed to developing big data technologies and solutions …, it will require an
ecosystem of providers … to create a consistent framework around which everyone can develop.
- Siki Giunta, SVP, Verizon
 The Hadoop ecosystem is evolving at a faster pace than is comfortable
“My personal speculation is that it comes from some who have been evaluating for a while seeing
change occur so rapidly that they are dropping back for another look.”
– Merv Adrian, VP, Gartner

Certify a standard “ODP Core” set of
open source Hadoop family projects
with specific versions and patch levels
Develop tools and methods to help
solution providers to test applications
against the ODP Core.
Contribute changes and fixes in the
ODP Core Hadoop family projects to
the ASF using the ASF processes.
http://opendataplatform.org/

Open Data Platform Initiative
Representation across the
Hadoop ecosystem…
• Hadoop distribution vendors
• Software application providers
• System integrators/consultants
• Hardware vendors
• Customers
… who all believe in the need for a community-based effort to
standardize Hadoop, which will lead to improved adoption

IBM Open Platform with Apache Hadoop adopts ODP Core
 BigInsights will include ODP certified Apache packages
 ODP will initially target core packages of a Hadoop distribution
 Packages will expand over time
 First certification set expected this summer
 Our goal for BigInsights on ODP
 Better compatibility and less testing against ecosystem software
 Enable IBM Hadoop capabilities to run on other ODP-certified Hadoop
distributions
HDFS
YARN
MapReduce
Ambari HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
ODP
* Candidate set of certified ODP modules – expected summer 2015
Apache Open Source Components

Goal of the ODP: Enable Innovation to Flourish on a Common Platform
• Complements the Apache Software
Foundation’s governance model
• ODP efforts focus on integration,
testing, and certifying a standard core
of Apache Hadoop ecosystem projects
• Fixes for issues found in ODP testing
will be contributed to the ASF projects
in line with ASF processes
• The ODP will not override or replace
any aspect of ASF governance

Text Analytics
POSIX Distributed File System
Multi-workload, Multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning
with Big R
Big R
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Big SQL
BigSheets
for Apache Hadoop
IBM BigInsights for Apache Hadoop

IBM BigInsights for Apache Hadoop
IBM System zIBM PowerIntel Servers On Cloud
Your choice of infrastructure and deployment model

IBM Analytic Platform Capabilities
IBM Software Integrates and Extends Hadoop and Spark
Data Warehousing
PureData for Analytics, Operational Analytics
Entity Extraction and Matching
Big Match
Security and Compliance
Optim, Guardium Audit and Encryption
Data Integration and Governance
Information Server
Enterprise Search
Watson Explorer
Real-time Analytics
Streams
Predictive Modeling and Descriptive Statistics
SPSS, Big R and Scalable Algorithms
Analysis, Reporting, and Exploration
Watson Analytics, Cognos, BigSheets
Fast, ANSI SQL 2011, and Secure SQL
Big SQL
Enterprise File System
GPFS-FPO
Cluster Resource and Workload Management
Platform Symphony
Large Scale Text Extraction
Big Text

IBM Leads the Market and Analysts Agree
“IBM’s all-in bet on Apache Hadoop clearly has had the
biggest impact among developers we polled”
- Evans Big Data Survey
Leading Hadoop Distribution Leading Streaming Analytics Solution

IBM’s Investment in the Big Data Community
Over 250,000 benefit from free Big Data skills training
http://bigdatauniversity.com

Spark Technology Center
 Focal point for IBM investment in Spark
 Code contributions to Apache Spark project
 Build industry solutions using Spark
 Evangelize Spark technology inside/outside IBM
 Agile engagement across IBM divisions
 Systems: contribute enhancements to Spark core, and optimized
infrastructure (hardware/software) for Spark
 Analytics: IBM Analytics software will exploit Spark processing
 Research: build innovations above (solutions that use Spark), inside
(improvements to Spark core), and below (improve systems that execute
Spark) the Spark stack
Goal: To be the #1 contributor and adopter in the Spark ecosystem

The IBM Difference
 IBM delivers the foundation for Big Data – now and in the future
 Embraces open source
 Establishes standards
 Integrates with familiar interfaces and established systems
 Delivers advanced analytic capabilities
 Enables you to benefit from broader data and analytics capabilities
 Data Integration and Governance
 Predictive and Real-time Analytics
 Provides expertise to help you on your journey
 6,000 partners
 Analytics services and solution centers

Ibm leads way with hadoop and spark 2015 may 15

Ibm leads way with hadoop and spark 2015 may 15

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Ibm leads way with hadoop and spark 2015 may 15

Similaire à Ibm leads way with hadoop and spark 2015 may 15 (20)

Plus de IBMInfoSphereUGFR

Plus de IBMInfoSphereUGFR (8)

Dernier

Dernier (20)

Ibm leads way with hadoop and spark 2015 may 15