SlideShare une entreprise Scribd logo
1  sur  50
© Hortonworks Inc. 2013
Apache Hadoop for Big Science
History, Use cases & Futures
Eric Baldeschwieler, “Eric14”
Hortonworks CTO
@jeric14
© Hortonworks Inc. 2013
Agenda
• What is Apache Hadoop
• Project motivation & history
• Use cases
• Futures and observations
© Hortonworks Inc. 2013
Page 3
What is Apache Hadoop?
© Hortonworks Inc. 2013
Traditional data systems vs. Hadoop
Traditional data systems
–Limited scaling options
–Expensive at scale
– Complex components
– Proprietary software
– Reliability in Hardware
–Optimized for latency, IOPs
Page 4
Hadoop Cluster
–Low cost scale-out
– Commodity components
– Open source software
– Reliability in software
–Optimized for throughput
When your data infrastructure does not scale … Hadoop
© Hortonworks Inc. 2013
StorageApache Hadoop: Big Data Platform
Open Source data management
with scale-out storage &
distributed processing
Page 5
HDFS
• Distributed across a cluster
• Natively redundant, self-healing
• Very high bandwidth
Processing
Map Reduce
• Splits a job into small tasks and
moves compute “near” the data
• Self-Healing
• Simple programming model
Key Characteristics
• Scalable
– Efficiently store and process
petabytes of data
– Scale out linearly by adding nodes
(node == commodity computer)
• Reliable
– Data replicated 3x
– Failover across nodes and racks,
• Flexible
– Store all types of data in any format
• Economical
– Commodity hardware
– Open source software (via ASF)
– No vendor lock-in
© Hortonworks Inc. 2013 (From Richard McDougall, VMware, Hadoop Summit, 2012 talk)
Hadoop’s cost advantage
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5Petabytes
1,000,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
400,000 IOPS
2Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:
20 Petabytes
10,000,000 IOPS
800 Gbytes/sec
© Hortonworks Inc. 2013
Hadoop hardware
• 10 to 4500 node
clusters
–1-4 “master nodes”
–Interchangeable workers
• Typical node
–4-12 * 2-4TB SATA
–64GB RAM
–2 * 4-8 core, ~2GHz
–2 * 1Gb NICs
–Single power supply
–jBOD, not RAID, …
• Switches
–1-2 Gb to the node
–~20 Gb to the core
–Full bisection bandwidth
–Layer 2 or 3, simple
Page 7
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
Zooming out: An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness:
HA, DR, Snapshots, Security
, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store, Proces
s and Access
Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
© Hortonworks Inc. 2013
Zooming out: A Big Data Architecture
Page 9
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM
© Hortonworks Inc. 2013
Motivation and History
2007 2008 2009
1
2010
The Datagraph Blo
© Hortonworks Inc. 2013
Eric Baldeschwieler - CTO Hortonworks
Page 11
• 2011 – now Hortonworks - CTO
• 2006 – 2011 Yahoo! - VP Engineering, Hadoop
• 2003 – 2005 Yahoo! – Web Search Engineering
- Built systems that crawl & index the web
• 1996 – 2003 Inktomi – Web Search Engineering
- Built systems that crawl & index the web
• Previously
– UC Berkeley – Masters CS
– Video Game Development
– Digital Video & 3D rendering software
– Carnegie Mellon – BS Math/CS
© Hortonworks Inc. 2013
Early history
• 1995 – 2005
–Yahoo! search team builds 4+ generations of systems to crawl &
index the WWW. 20 Billion pages!
• 2004
–Google publishes Google File System & MapReduce papers
• 2005
–Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo!
–Yahoo! search commits to build open source DFS & MapReduce
– Compete / Differentiate via Open Source contribution!
– Attract scientists – Become known center of big data excellence
– Avoid building proprietary systems that will be obsolesced
– Gain leverage of wider community building one infrastructure
• 2006
–Hadoop is born!
– Dedicated team under E14 staffed at Yahoo!
– Nutch prototype used to seed new Apache Hadoop project
© Hortonworks Inc. 2013
Hadoop at Yahoo!
Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/
© Hortonworks Inc. 2013
Hortonworks – 100% Open Source
Page 14
• We distribute the only 100%
Open Source Enterprise
Hadoop Distribution:
Hortonworks Data
Platform
• We engineer, test & certify
HDP for enterprise usage
• We employ the core
architects, builders and
operators of Apache Hadoop
• We drive innovation within
Apache Software
Foundation projects
• We are uniquely positioned
to deliver the highest quality
of Hadoop support
• We enable the ecosystem to
work better with Hadoop
Develop Distribute Support
We develop, distribute and support
the ONLY 100% open source
Enterprise Hadoop distribution
Endorsed by Strategic Partners
Headquarters: Palo Alto, CA
Employees: 200+ and growing
Investors: Benchmark, Index, Yahoo
twice the engagement
CASE STUDY
YAHOO SEARCH ASSIST™
15
© Yahoo 2011
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
• Database for Search Assist™ is built using Apache Hadoop
• Several years of log-data
• 20-steps of MapReduce
15
, early adopters
Scale and productize Hadoop
Apache Hadoop
Ecosystem History
2006 – present
Wide Adoption
Funds further development, enhancements
2011 – present
Other Internet Companies
Add tools / frameworks, enhance
Hadoop
2008 – present
…
16
Service Providers
Provide training, support, hosting
2010 – present
…
Cloudera, MapR
Microsoft
IBM, EMC, Oracle
© Hortonworks Inc. 2013
Use cases
The “New” Na
© Hortonworks Inc. 2013
Page 18
© Hortonworks Inc. 2013
Use-case: Full genome sequencing
• The data
–1 full genome = 1TB (raw uncompressed)
–1M people sequenced = 1 Exabyte
–Cost per 1 person = $1000 and continues to drop
• Uses for Hadoop:
–Large scale compute applications:
– Map NGS data (“reads”) to a reference genome
– Used for drug development, personalized treatment
– Community developed Hadoop-based software for gene matching:
cloudburst, crossbow
–Store, manage and share genomics data in the bio-informatics
community
Page 19
See: http://hortonworks.com/blog/big-data-in-genomics-and-cancer-treatment
© Hortonworks Inc. 2013
Use-case: Oil & gas
• Digital Oil Field:
–Data sizes: 2+ TB / day
–Application: safety/security, improve field performance
–Hadoop used for data storage and analytics
• Seismic image processing:
–Drill ship costs $1M/day
–One “shot” (in SEGY format) contains ~2.5GB
–Hadoop used to parallelize computation and store data post-
processing
Page 20
–Previously data discarded
immediately after processing!
– Now kept for reprocessing
– Research & Development
© Hortonworks Inc. 2013
Use-case: high-energy physics
• Collecting events from colliders
–“We have a very big digital camera”; each “event” = ~1MB
–Looking for rare events (need millions of events for stat significance)
• Typical task: scan through events and look for particles with
a certain mass
–Analyze millions of events in parallel
–Hadoop used in streaming with C++ code to analyze events
• HDFS used for low cost storage
Page 21
http://www.linuxjournal.com/content/t
he-large-hadron-collider
© Hortonworks Inc. 2013
Use-case: National Climate Assessment
• Rapid, Flexible, and Open Source
Big Data Technologies for the U.S.
National Climate Assessment
–Chris A. Mattmann
–Senior Computer Scientist, NASA JPL
–Chris and team have done a number of
projects with Hadoop.
• Goal
–Compare regional climate models to a
variety of satellite observations
–Traditionally models are compared to
other models, not to actual observations
–Normalize data complex multi-format data
to lat/long + observation values
• Hadoop
–Used Apache Hive to provide Scale-out
SQL warehouse of the data
–See paper or case study in
“Programming Hive – O’Reilly 2012”
Credit: Kathy Jacobs
The “New” Nation
Goal
• En
to
ch
Vision
• Ad
su
co
the
© Hortonworks Inc. 2013
Big Data
Transactions + Interactions + Observations
Apache Hadoop: Patterns of Use
Page 23
Refine Explore Enrich
© Hortonworks Inc. 2013
Enterprise
Data Warehouse
Operational Data Refinery
Hadoop as platform for ETL modernization
Capture
• Capture new unstructured data along with log
files all alongside existing sources
• Retain inputs in raw form for audit and
continuity purposes
Process
• Parse the data & cleanse
• Apply structure and definition
• Join datasets together across disparate data
sources
Exchange
• Push to existing data warehouse for
downstream consumption
• Feeds operational reporting and online systems
Page 24
Unstructured Log files
Refinery
Structure and join
Capture and archive
Parse & Cleanse
Refine Explore
Enric
h
DB data
Upload
© Hortonworks Inc. 2013
Visualization
ToolsEDW / Datamart
Explore
Big Data Exploration
Hadoop as agile, ad-hoc data mart
Capture
• Capture multi-structured data and retain inputs
in raw form for iterative analysis
Process
• Parse the data into queryable format
• Explore & analyze using Hive, Pig, Mahout and
other tools to discover value
• Label data and type information for
compatibility and later discovery
• Pre-compute stats, groupings, patterns in data
to accelerate analysis
Exchange
• Use visualization tools to facilitate exploration
and find key insights
• Optionally move actionable insights into EDW
or datamart
Page 25
Capture and archive
upload JDBC / ODBC
Structure and join
Categorize into tables
Unstructured Log files DB data
Refine Explore Enrich
Optional
31-Mar-2013 NCAR-SEA-2013 26
© Hortonworks Inc. 2013
Online
Applications
Enrich
Application Enrichment
Deliver Hadoop analysis to online apps
Capture
• Capture data that was once
too bulky and unmanageable
Process
• Uncover aggregate characteristics across data
• Use Hive Pig and Map Reduce to identify patterns
• Filter useful data from mass streams (Pig)
• Micro or macro batch oriented schedules
Exchange
• Push results to HBase or other NoSQL alternative
for real time delivery
• Use patterns to deliver right content/offer to the
right person at the right time
Page 27
Derive/Filter
Capture
Parse
NoSQL, HBase
Low Latency
Scheduled &
near real time
Unstructured Log files DB data
Refine Explore Enrich
twice the engagement
CASE STUDY
YAHOO! HOMEPAGE
28
Personalized
for each visitor
Result:
twice the engagement
+160% clicks
vs. one size fits all
+79% clicks
vs. randomly selected
+43% clicks
vs. editor selected
Recommended links News Interests Top Searches
© Yahoo 2011 28
CASE STUDY
YAHOO! HOMEPAGE
29
• Serving Maps
• Users - Interests
• Five Minute
Production
• Weekly
Categorization
models
SCIENCE
HADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTION
HADOOP
CLUSTER
USER
BEHAVIOR
ENGAGED USERS
CATEGORIZATION
MODELS (weekly)
SERVING
MAPS
(every 5 minutes)
USER
BEHAVIOR
» Identify user interests
using Categorization
models
» Machine learning to build
ever better categorization
models
Build customized home pages with latest data (thousands / second)
© Yahoo 2011 29
© Hortonworks Inc. 2013
Futures & observations
© Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN
HDFS
MapReduce
Redundant, Reliable Storage
• Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation
• Next generation execution
– Improves MapReduce performance
• Supports new frameworks beyond
MapReduce
– Do more with a single Hadoop cluster
– Low latency, Streaming, Services
– Science – MPI, Spark, Giraph
© Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN
• Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation
• Next generation execution
– Improves MapReduce performance
• Supports new frameworks beyond
MapReduce
– Do more with a single Hadoop cluster
– Low latency, Streaming, Services
– Science – MPI, Spark, Giraph
HDFS
MapReduce
Redundant, Reliable Storage
YARN: Cluster Resource Management
Tez
Streaming
Other
© Hortonworks Inc. 2013
Stinger Initiative
• Community initiative around Hive
• Enables Hive to support interactive workloads
• Improves existing tools & preserves investments
Query
Planner
Hive
Execution
Engine
Tez
= 100X+ +
File
Format
ORC file
© Hortonworks Inc. 2013
Data Lake Projects
• Keep raw data
–20+ PB projects
–Previously discarded
• Unify may data sources
–Pull from all over organization
• Produce derived views
–Automatic “ETL” for regular
downstream use cases
–New applications from unified data
• Support ad hoc exploration
–Prototype new use cases
–Answer unanticipated questions
–Agile rebuild from raw data
c
stage
Core-
general
archive
Landing
zone
(NFS, JMS
)
Ingest
Descri
ptor
Core-
secure
c
b
a
b
a
Data flow described in
descriptor docs
© Hortonworks Inc. 2013
Interesting things on the Horizon
• Solid state storage and disk drive evolution
–So far LFF drives seem to be maintaining their economic
advantage (4TB drives now & 7TB! Next year)
–SSDs are becoming ubiquitous and will become part of the
architecture
• In RAM databases
–Bring them on, let’s port them to Yarn!
–Hadoop complements these technologies, shines w huge data
• Atom / ARM processors
–This is great for Hadoop! But…
–Vendors are not yet designing the right machines (bandwidth to
disk)
• Software Defined Networks
–This is great for Hadoop, more network for less!
© Hortonworks Inc. 2013
Thank You!
Eric Baldeschwieler
CTO Hortonworks
Twitter @jeric14
Apache
Foundation
New Users
Contributions
&
Validation
Get Involved!
© Hortonworks Inc. 2013
See Hadoop > Learn Hadoop > Do Hadoop
Full environment
to evaluate
Hadoop
Hands on
step-by- step
tutorials to learn
© Hortonworks Inc. 2013
STOP!
Bonus material follows
© Hortonworks Inc. 2013
Hortonworks Approach
Identify and introduce enterprise
requirements into the pubic domain
Work with the community to advance and
incubate open source projects
Apply Enterprise Rigor to provide the most
stable and reliable distribution
Community Driven Enterprise Apache Hadoop
© Hortonworks Inc. 2013
Driving Enterprise Hadoop Innovation
Page 40
0% 20% 40% 60% 80% 100%
AMBARI
HBASE
HCATALOG
HIVE
PIG
HADOOP
CORE
Lines Of Code By Company
Source: Apache Software Foundation
Hortonworks Yahoo!
Cloudera Other
Hortonworks
Committers
Cloudera
Committers
19 9
5 1
1 0
5 0
3 7
14 0
© Hortonworks Inc. 2013
Hortonworks Process for Enterprise Hadoop
Page 41
Upstream Community Projects Downstream Enterprise Product
Hortonworks
Data Platform
Design &
Develop
Distribute
Integrate
& Test
Package
& Certify
Apache
HCatalo
g
Apache
Pig
Apache
HBase
Other
Apache
Projects
Apache
Hive
Apache
Ambari
Apache
Hadoop
Test &
Patch
Design & Develop
Release
No Lock-in: Integrated, tested & certified distribution lowers
risk by ensuring close alignment with Apache projects
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Stable Project
Releases
Fixed Issues
© Hortonworks Inc. 2013
Hadoop and Cloud
• Can I run hadoop in Open stack or in my virtualization
infrastructure?
–Yes, but… it depends on your use-case and hardware choices
–We will see a lot of innovation in this space in coming years
– Openstack Savanna – Collaboration to bring Hadoop to Openstack
• Zero procurement POC – Try Hadoop in cloud
–5-10 nodes – works great! (On private or public cloud)
–Many projects are done today in public clouds
• Occasional use (run Hadoop when cluster not busy)
–Where do you store the data when Hadoop is not running?
–>20 nodes  review your network and storage design
• Large scale, continuous deployment 100 – 4000 nodes
–Need to design your storage and network for Hadoop
Page 42
© Hortonworks Inc. 2013
BI – Jaspersoft, Pentaho, …
NoSQL in Apps – HBase, Cassandra, MangoDB, …
Search Apps – ElasticSearch, Solr, …
Open source in the Architecture
Page 43
APPLICATIONSDATASYSTEMS
DBs – Postgres, MySQL
Search – ElasticSearch, Solr, …
DATASOURCES
OPERATIONAL
TOOLS
DEV & DATA
TOOLS
HORTONWORKS
DATA PLATFORM
Eclipse, OpenJDK,
Spring, VirtualBox…
Nagios, Ganglia, Ch
ef, Puppet…
DBs
Search Repos
ESB, ETL – ActiveMQ, Talend, Kettle
twice the engagement
CASE STUDY
YAHOO! WEBMAP
44
© Yahoo 2011
 What is a WebMap?
• Gigantic table of information about every web site,
page and link Yahoo! knows about
• Directed graph of the web
• Various aggregated views (sites, domains, etc.)
• Various algorithms for ranking, duplicate detection,
region classification, spam detection, etc.
 Why was it ported to Hadoop?
• Custom C++ solution was not scaling
• Leverage scalability, load balancing and resilience of
Hadoop infrastructure
• Focus on application vs. infrastructure
44
twice the engagement
CASE STUDY
WEBMAP PROJECT RESULTS
45
© Yahoo 2011
 33% time savings over previous system on the
same cluster (and Hadoop keeps getting
better)
 Was largest Hadoop application, drove scale
• Over 10,000 cores in system
• 100,000+ maps, ~10,000 reduces
• ~70 hours runtime
• ~300 TB shuffling
• ~200 TB compressed output
 Moving data to Hadoop increased number of
groups using the data
45
© Hortonworks Inc. 2013
Use-case: computational advertising
• A principled way to find “best match” ads, in context, for a
query (or page view)
• Lots of data:
–Search: billions of unique queries per hour
–Display: Trillions of ads displayed per hour
–Billions of users
–Billions of ads
• Big business:
–$132B total advertising market (2015)
–$600B total worldwide market (2015)
• Challenges:
–A huge number of small transactions
–Cost of serving < revenue per search
Page 46
Example: predicting CTR (search ads)
Rank = bid * CTR
Predict CTR for each ad to
determine placement, based on:
- Historical CTR
- Keyword match
- Etc…
Approach: supervised learning
© Hortonworks Inc. 2013
Hadoop for advertising science @ Yahoo!
• Advertising science moved CTR prediction from “legacy”
(MyNA) systems to Hadoop
–Scientist productivity dramatically improved
–Platform for massive A/B testing for computational advertising
algorithmic improvements
• Hadoop enabled next-gen contextual advertising matching
platform
–Heavy compute process that is highly parallelizable
Page 48
MapReduce
• MapReduce is a distributed computing programming model
• It works like a Unix pipeline:
– cat input | grep | sort | uniq -c > output
– Input | Map | Shuffle & Sort | Reduce |
Output
• Strengths:
– Easy to use! Developer just writes a couple of
functions
– Moves compute to data
• Schedules work on HDFS node with data if possible
– Scans through data, reducing seeks
– Automatic reliability and re-execution on failure
49
49
© Hortonworks Inc. 2013
HDFS
Client
NameNode
DataNode 1 DataNode 2 DataNode 3
Big Data
Put into HDFS
(Via RPC or REST)
Break the data into chunks and
distribute to the DataNodes
The DataNodes replicate the chunks
HDFS in action

Contenu connexe

Tendances

Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 

Tendances (20)

Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data London
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 

Similaire à 201305 hadoop jpl-v3

Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionHortonworks
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksHortonworks
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Hortonworks
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 

Similaire à 201305 hadoop jpl-v3 (20)

Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 

201305 hadoop jpl-v3

  • 1. © Hortonworks Inc. 2013 Apache Hadoop for Big Science History, Use cases & Futures Eric Baldeschwieler, “Eric14” Hortonworks CTO @jeric14
  • 2. © Hortonworks Inc. 2013 Agenda • What is Apache Hadoop • Project motivation & history • Use cases • Futures and observations
  • 3. © Hortonworks Inc. 2013 Page 3 What is Apache Hadoop?
  • 4. © Hortonworks Inc. 2013 Traditional data systems vs. Hadoop Traditional data systems –Limited scaling options –Expensive at scale – Complex components – Proprietary software – Reliability in Hardware –Optimized for latency, IOPs Page 4 Hadoop Cluster –Low cost scale-out – Commodity components – Open source software – Reliability in software –Optimized for throughput When your data infrastructure does not scale … Hadoop
  • 5. © Hortonworks Inc. 2013 StorageApache Hadoop: Big Data Platform Open Source data management with scale-out storage & distributed processing Page 5 HDFS • Distributed across a cluster • Natively redundant, self-healing • Very high bandwidth Processing Map Reduce • Splits a job into small tasks and moves compute “near” the data • Self-Healing • Simple programming model Key Characteristics • Scalable – Efficiently store and process petabytes of data – Scale out linearly by adding nodes (node == commodity computer) • Reliable – Data replicated 3x – Failover across nodes and racks, • Flexible – Store all types of data in any format • Economical – Commodity hardware – Open source software (via ASF) – No vendor lock-in
  • 6. © Hortonworks Inc. 2013 (From Richard McDougall, VMware, Hadoop Summit, 2012 talk) Hadoop’s cost advantage SAN Storage $2 - $10/Gigabyte $1M gets: 0.5Petabytes 1,000,000 IOPS 1Gbyte/sec NAS Filers $1 - $5/Gigabyte $1M gets: 1 Petabyte 400,000 IOPS 2Gbyte/sec Local Storage $0.05/Gigabyte $1M gets: 20 Petabytes 10,000,000 IOPS 800 Gbytes/sec
  • 7. © Hortonworks Inc. 2013 Hadoop hardware • 10 to 4500 node clusters –1-4 “master nodes” –Interchangeable workers • Typical node –4-12 * 2-4TB SATA –64GB RAM –2 * 4-8 core, ~2GHz –2 * 1Gb NICs –Single power supply –jBOD, not RAID, … • Switches –1-2 Gb to the node –~20 Gb to the core –Full bisection bandwidth –Layer 2 or 3, simple Page 7
  • 8. © Hortonworks Inc. 2013 ApplianceCloudOS / VM Zooming out: An Apache Hadoop Platform HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security , … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Proces s and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 9. © Hortonworks Inc. 2013 Zooming out: A Big Data Architecture Page 9 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications HORTONWORKS DATA PLATFORM
  • 10. © Hortonworks Inc. 2013 Motivation and History 2007 2008 2009 1 2010 The Datagraph Blo
  • 11. © Hortonworks Inc. 2013 Eric Baldeschwieler - CTO Hortonworks Page 11 • 2011 – now Hortonworks - CTO • 2006 – 2011 Yahoo! - VP Engineering, Hadoop • 2003 – 2005 Yahoo! – Web Search Engineering - Built systems that crawl & index the web • 1996 – 2003 Inktomi – Web Search Engineering - Built systems that crawl & index the web • Previously – UC Berkeley – Masters CS – Video Game Development – Digital Video & 3D rendering software – Carnegie Mellon – BS Math/CS
  • 12. © Hortonworks Inc. 2013 Early history • 1995 – 2005 –Yahoo! search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages! • 2004 –Google publishes Google File System & MapReduce papers • 2005 –Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo! –Yahoo! search commits to build open source DFS & MapReduce – Compete / Differentiate via Open Source contribution! – Attract scientists – Become known center of big data excellence – Avoid building proprietary systems that will be obsolesced – Gain leverage of wider community building one infrastructure • 2006 –Hadoop is born! – Dedicated team under E14 staffed at Yahoo! – Nutch prototype used to seed new Apache Hadoop project
  • 13. © Hortonworks Inc. 2013 Hadoop at Yahoo! Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/
  • 14. © Hortonworks Inc. 2013 Hortonworks – 100% Open Source Page 14 • We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform • We engineer, test & certify HDP for enterprise usage • We employ the core architects, builders and operators of Apache Hadoop • We drive innovation within Apache Software Foundation projects • We are uniquely positioned to deliver the highest quality of Hadoop support • We enable the ecosystem to work better with Hadoop Develop Distribute Support We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution Endorsed by Strategic Partners Headquarters: Palo Alto, CA Employees: 200+ and growing Investors: Benchmark, Index, Yahoo
  • 15. twice the engagement CASE STUDY YAHOO SEARCH ASSIST™ 15 © Yahoo 2011 Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days • Database for Search Assist™ is built using Apache Hadoop • Several years of log-data • 20-steps of MapReduce 15
  • 16. , early adopters Scale and productize Hadoop Apache Hadoop Ecosystem History 2006 – present Wide Adoption Funds further development, enhancements 2011 – present Other Internet Companies Add tools / frameworks, enhance Hadoop 2008 – present … 16 Service Providers Provide training, support, hosting 2010 – present … Cloudera, MapR Microsoft IBM, EMC, Oracle
  • 17. © Hortonworks Inc. 2013 Use cases The “New” Na
  • 18. © Hortonworks Inc. 2013 Page 18
  • 19. © Hortonworks Inc. 2013 Use-case: Full genome sequencing • The data –1 full genome = 1TB (raw uncompressed) –1M people sequenced = 1 Exabyte –Cost per 1 person = $1000 and continues to drop • Uses for Hadoop: –Large scale compute applications: – Map NGS data (“reads”) to a reference genome – Used for drug development, personalized treatment – Community developed Hadoop-based software for gene matching: cloudburst, crossbow –Store, manage and share genomics data in the bio-informatics community Page 19 See: http://hortonworks.com/blog/big-data-in-genomics-and-cancer-treatment
  • 20. © Hortonworks Inc. 2013 Use-case: Oil & gas • Digital Oil Field: –Data sizes: 2+ TB / day –Application: safety/security, improve field performance –Hadoop used for data storage and analytics • Seismic image processing: –Drill ship costs $1M/day –One “shot” (in SEGY format) contains ~2.5GB –Hadoop used to parallelize computation and store data post- processing Page 20 –Previously data discarded immediately after processing! – Now kept for reprocessing – Research & Development
  • 21. © Hortonworks Inc. 2013 Use-case: high-energy physics • Collecting events from colliders –“We have a very big digital camera”; each “event” = ~1MB –Looking for rare events (need millions of events for stat significance) • Typical task: scan through events and look for particles with a certain mass –Analyze millions of events in parallel –Hadoop used in streaming with C++ code to analyze events • HDFS used for low cost storage Page 21 http://www.linuxjournal.com/content/t he-large-hadron-collider
  • 22. © Hortonworks Inc. 2013 Use-case: National Climate Assessment • Rapid, Flexible, and Open Source Big Data Technologies for the U.S. National Climate Assessment –Chris A. Mattmann –Senior Computer Scientist, NASA JPL –Chris and team have done a number of projects with Hadoop. • Goal –Compare regional climate models to a variety of satellite observations –Traditionally models are compared to other models, not to actual observations –Normalize data complex multi-format data to lat/long + observation values • Hadoop –Used Apache Hive to provide Scale-out SQL warehouse of the data –See paper or case study in “Programming Hive – O’Reilly 2012” Credit: Kathy Jacobs The “New” Nation Goal • En to ch Vision • Ad su co the
  • 23. © Hortonworks Inc. 2013 Big Data Transactions + Interactions + Observations Apache Hadoop: Patterns of Use Page 23 Refine Explore Enrich
  • 24. © Hortonworks Inc. 2013 Enterprise Data Warehouse Operational Data Refinery Hadoop as platform for ETL modernization Capture • Capture new unstructured data along with log files all alongside existing sources • Retain inputs in raw form for audit and continuity purposes Process • Parse the data & cleanse • Apply structure and definition • Join datasets together across disparate data sources Exchange • Push to existing data warehouse for downstream consumption • Feeds operational reporting and online systems Page 24 Unstructured Log files Refinery Structure and join Capture and archive Parse & Cleanse Refine Explore Enric h DB data Upload
  • 25. © Hortonworks Inc. 2013 Visualization ToolsEDW / Datamart Explore Big Data Exploration Hadoop as agile, ad-hoc data mart Capture • Capture multi-structured data and retain inputs in raw form for iterative analysis Process • Parse the data into queryable format • Explore & analyze using Hive, Pig, Mahout and other tools to discover value • Label data and type information for compatibility and later discovery • Pre-compute stats, groupings, patterns in data to accelerate analysis Exchange • Use visualization tools to facilitate exploration and find key insights • Optionally move actionable insights into EDW or datamart Page 25 Capture and archive upload JDBC / ODBC Structure and join Categorize into tables Unstructured Log files DB data Refine Explore Enrich Optional
  • 27. © Hortonworks Inc. 2013 Online Applications Enrich Application Enrichment Deliver Hadoop analysis to online apps Capture • Capture data that was once too bulky and unmanageable Process • Uncover aggregate characteristics across data • Use Hive Pig and Map Reduce to identify patterns • Filter useful data from mass streams (Pig) • Micro or macro batch oriented schedules Exchange • Push results to HBase or other NoSQL alternative for real time delivery • Use patterns to deliver right content/offer to the right person at the right time Page 27 Derive/Filter Capture Parse NoSQL, HBase Low Latency Scheduled & near real time Unstructured Log files DB data Refine Explore Enrich
  • 28. twice the engagement CASE STUDY YAHOO! HOMEPAGE 28 Personalized for each visitor Result: twice the engagement +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News Interests Top Searches © Yahoo 2011 28
  • 29. CASE STUDY YAHOO! HOMEPAGE 29 • Serving Maps • Users - Interests • Five Minute Production • Weekly Categorization models SCIENCE HADOOP CLUSTER SERVING SYSTEMS PRODUCTION HADOOP CLUSTER USER BEHAVIOR ENGAGED USERS CATEGORIZATION MODELS (weekly) SERVING MAPS (every 5 minutes) USER BEHAVIOR » Identify user interests using Categorization models » Machine learning to build ever better categorization models Build customized home pages with latest data (thousands / second) © Yahoo 2011 29
  • 30. © Hortonworks Inc. 2013 Futures & observations
  • 31. © Hortonworks Inc. 2013 Hadoop 2.0 Innovations - YARN HDFS MapReduce Redundant, Reliable Storage • Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation • Next generation execution – Improves MapReduce performance • Supports new frameworks beyond MapReduce – Do more with a single Hadoop cluster – Low latency, Streaming, Services – Science – MPI, Spark, Giraph
  • 32. © Hortonworks Inc. 2013 Hadoop 2.0 Innovations - YARN • Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation • Next generation execution – Improves MapReduce performance • Supports new frameworks beyond MapReduce – Do more with a single Hadoop cluster – Low latency, Streaming, Services – Science – MPI, Spark, Giraph HDFS MapReduce Redundant, Reliable Storage YARN: Cluster Resource Management Tez Streaming Other
  • 33. © Hortonworks Inc. 2013 Stinger Initiative • Community initiative around Hive • Enables Hive to support interactive workloads • Improves existing tools & preserves investments Query Planner Hive Execution Engine Tez = 100X+ + File Format ORC file
  • 34. © Hortonworks Inc. 2013 Data Lake Projects • Keep raw data –20+ PB projects –Previously discarded • Unify may data sources –Pull from all over organization • Produce derived views –Automatic “ETL” for regular downstream use cases –New applications from unified data • Support ad hoc exploration –Prototype new use cases –Answer unanticipated questions –Agile rebuild from raw data c stage Core- general archive Landing zone (NFS, JMS ) Ingest Descri ptor Core- secure c b a b a Data flow described in descriptor docs
  • 35. © Hortonworks Inc. 2013 Interesting things on the Horizon • Solid state storage and disk drive evolution –So far LFF drives seem to be maintaining their economic advantage (4TB drives now & 7TB! Next year) –SSDs are becoming ubiquitous and will become part of the architecture • In RAM databases –Bring them on, let’s port them to Yarn! –Hadoop complements these technologies, shines w huge data • Atom / ARM processors –This is great for Hadoop! But… –Vendors are not yet designing the right machines (bandwidth to disk) • Software Defined Networks –This is great for Hadoop, more network for less!
  • 36. © Hortonworks Inc. 2013 Thank You! Eric Baldeschwieler CTO Hortonworks Twitter @jeric14 Apache Foundation New Users Contributions & Validation Get Involved!
  • 37. © Hortonworks Inc. 2013 See Hadoop > Learn Hadoop > Do Hadoop Full environment to evaluate Hadoop Hands on step-by- step tutorials to learn
  • 38. © Hortonworks Inc. 2013 STOP! Bonus material follows
  • 39. © Hortonworks Inc. 2013 Hortonworks Approach Identify and introduce enterprise requirements into the pubic domain Work with the community to advance and incubate open source projects Apply Enterprise Rigor to provide the most stable and reliable distribution Community Driven Enterprise Apache Hadoop
  • 40. © Hortonworks Inc. 2013 Driving Enterprise Hadoop Innovation Page 40 0% 20% 40% 60% 80% 100% AMBARI HBASE HCATALOG HIVE PIG HADOOP CORE Lines Of Code By Company Source: Apache Software Foundation Hortonworks Yahoo! Cloudera Other Hortonworks Committers Cloudera Committers 19 9 5 1 1 0 5 0 3 7 14 0
  • 41. © Hortonworks Inc. 2013 Hortonworks Process for Enterprise Hadoop Page 41 Upstream Community Projects Downstream Enterprise Product Hortonworks Data Platform Design & Develop Distribute Integrate & Test Package & Certify Apache HCatalo g Apache Pig Apache HBase Other Apache Projects Apache Hive Apache Ambari Apache Hadoop Test & Patch Design & Develop Release No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Stable Project Releases Fixed Issues
  • 42. © Hortonworks Inc. 2013 Hadoop and Cloud • Can I run hadoop in Open stack or in my virtualization infrastructure? –Yes, but… it depends on your use-case and hardware choices –We will see a lot of innovation in this space in coming years – Openstack Savanna – Collaboration to bring Hadoop to Openstack • Zero procurement POC – Try Hadoop in cloud –5-10 nodes – works great! (On private or public cloud) –Many projects are done today in public clouds • Occasional use (run Hadoop when cluster not busy) –Where do you store the data when Hadoop is not running? –>20 nodes  review your network and storage design • Large scale, continuous deployment 100 – 4000 nodes –Need to design your storage and network for Hadoop Page 42
  • 43. © Hortonworks Inc. 2013 BI – Jaspersoft, Pentaho, … NoSQL in Apps – HBase, Cassandra, MangoDB, … Search Apps – ElasticSearch, Solr, … Open source in the Architecture Page 43 APPLICATIONSDATASYSTEMS DBs – Postgres, MySQL Search – ElasticSearch, Solr, … DATASOURCES OPERATIONAL TOOLS DEV & DATA TOOLS HORTONWORKS DATA PLATFORM Eclipse, OpenJDK, Spring, VirtualBox… Nagios, Ganglia, Ch ef, Puppet… DBs Search Repos ESB, ETL – ActiveMQ, Talend, Kettle
  • 44. twice the engagement CASE STUDY YAHOO! WEBMAP 44 © Yahoo 2011  What is a WebMap? • Gigantic table of information about every web site, page and link Yahoo! knows about • Directed graph of the web • Various aggregated views (sites, domains, etc.) • Various algorithms for ranking, duplicate detection, region classification, spam detection, etc.  Why was it ported to Hadoop? • Custom C++ solution was not scaling • Leverage scalability, load balancing and resilience of Hadoop infrastructure • Focus on application vs. infrastructure 44
  • 45. twice the engagement CASE STUDY WEBMAP PROJECT RESULTS 45 © Yahoo 2011  33% time savings over previous system on the same cluster (and Hadoop keeps getting better)  Was largest Hadoop application, drove scale • Over 10,000 cores in system • 100,000+ maps, ~10,000 reduces • ~70 hours runtime • ~300 TB shuffling • ~200 TB compressed output  Moving data to Hadoop increased number of groups using the data 45
  • 46. © Hortonworks Inc. 2013 Use-case: computational advertising • A principled way to find “best match” ads, in context, for a query (or page view) • Lots of data: –Search: billions of unique queries per hour –Display: Trillions of ads displayed per hour –Billions of users –Billions of ads • Big business: –$132B total advertising market (2015) –$600B total worldwide market (2015) • Challenges: –A huge number of small transactions –Cost of serving < revenue per search Page 46
  • 47. Example: predicting CTR (search ads) Rank = bid * CTR Predict CTR for each ad to determine placement, based on: - Historical CTR - Keyword match - Etc… Approach: supervised learning
  • 48. © Hortonworks Inc. 2013 Hadoop for advertising science @ Yahoo! • Advertising science moved CTR prediction from “legacy” (MyNA) systems to Hadoop –Scientist productivity dramatically improved –Platform for massive A/B testing for computational advertising algorithmic improvements • Hadoop enabled next-gen contextual advertising matching platform –Heavy compute process that is highly parallelizable Page 48
  • 49. MapReduce • MapReduce is a distributed computing programming model • It works like a Unix pipeline: – cat input | grep | sort | uniq -c > output – Input | Map | Shuffle & Sort | Reduce | Output • Strengths: – Easy to use! Developer just writes a couple of functions – Moves compute to data • Schedules work on HDFS node with data if possible – Scans through data, reducing seeks – Automatic reliability and re-execution on failure 49 49
  • 50. © Hortonworks Inc. 2013 HDFS Client NameNode DataNode 1 DataNode 2 DataNode 3 Big Data Put into HDFS (Via RPC or REST) Break the data into chunks and distribute to the DataNodes The DataNodes replicate the chunks HDFS in action

Notes de l'éditeur

  1. I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
  2. http://grist.files.wordpress.com/2006/11/csxt_southbound_freight_train.jpghttp://businessguide.rw/main/gallery/Fedex-Fleet.jpg
  3. Notes… credit http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Hadoop-Cluster.PNG
  4. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
  5. Hadoop started to enhance SearchScience clusters launched in 2006 as early proof of conceptScience results drive new applications -&gt; becomes core Hadoop business
  6. At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
  7. Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private systemFrom YST
  8. I want to thank Chris for inviting me here today.Chris and team have done a number of projects with Hadoop.They are a great resource for Big Data projects.Chris is an Apache Board member and was a contributor to Hadoop even before we spun it out of the Nutch project.
  9. Archival use case at big bank:10K files a day == 400GBNeed to store all in EBCDIC format for complianceNeed to also convert to Hadoop for analyticsCompute a checksum for every record and keep a tally of which primary keys changed each dayAlso, bring together financial, customer, and weblogs for new insightsShare with Palantir, Aster Data, Vertica, Teradata, and more…Step One: Create tables or partitionsIn Step one of the dataflow the mainframe or another orchestration and control program notifies HCatalog of its intention to create a table or add a partition if the table exists. This would use standard SQL data definition language (DDL) such as CREATE TABLE and DESCRIBE TABLE (see http://incubator.apache.org/hcatalog/docs/r0.4.0/cli.html#HCatalog+DDL). Multiple tables need to be created though. Some tables are job-specific temporary tables while other tables need to be more permanent. Raw format storage data can be stored in an HCat table, partitioned by some date field (month or year, for example). The staged record data will most certainly be stored in HCatalog partitioned by month (see http://incubator.apache.org/hcatalog/docs/r0.4.0/dynpartition.html). Then any missing month in the table can be easily detected and generated from the raw format storage on the fly. In essence, HCatalog allows creation of tables which up-levels this architectural challenge from one of managing a bunch of manually created files and a loose naming convention to a strong yet abstract table structure much like a mature database solution would have.Step Two: Parallel IngestBefore or after tables are defined in the system, we can start adding data to the system in parallel using WebHDFS or DistCP. In the Teradata-Hortonworks Data Platform these architectural components work seamlessly with the standard HDFS namenode to notify DFS clients of all the datanodes to which to write data. For example, a file made up of 10,000 64 megabyte blocks could be transferred to a 100-node HDFS cluster using all 100 nodes at once. By asking WebHDFS for the write locations for each block, a multi-threaded or chunking client application could write each 64MB block in parallel, 100 blocks or more at a time, effectively dividing the 10,000-block into 100 waves of copying. 100 copy waves would complete 100 times faster than 10,000 one-by-one block copies. Parallel ingest with HCatalog, WebHDFS and/or DistCP will lead to massive speed gains.Critically, the system can copy chunked data directly into partitions in pre-defined tables in HCatalog. This means that each month, staged record data can join the staging tables without dropping previous months and staged data can be partitioned by month while each month itself is loaded using as many parallel ingest servers as solution architecture desires to balance cost with performance.Step Three: Notify on UploadNext, the Parallel ingest system needs to notify the HCatalog engine the files have been uploaded and, simultaneously, any end user transformation or analytics workload waiting for the partition need to be notified that the file is ready to support queries. By “ready” we mean that the partition is whole and is completely copied into HDFS. HCatalog has built in blocking and non-blocking notification APIs that use standard message buses to notify any interested parties that workload—be it MapReduce or HDFS copy work—is complete and valid (see: http://incubator.apache.org/hcatalog/docs/r0.4.0/notification.html). The way this system works is any job created through HCatalog is acknowledged with an output location. The messaging system later replies that a job is complete and since, when the job was submitted, the eventual output location was returned, the calling application can immediately return to the target output file and find its needed data. In this next-gen ETL use case, we will be using this notification system to immediately fire a Hive job to begin transformation whenever a partition is added to the raw or staged data tables. This will make the construction of systems that depend on these transformations easier in that these systems needn’t poll for data nor do those dependent systems need to hard-code file locations for sources and sinks of data moving through the dataflow.Step Four: Fire Off UDFsSince HCatalog can notify interested parties in the completion of file I/O tasks, and since Hcatalog stores file data underneath abstracted table and partition names and locations, invoking the core UDFs that transform mainframe’s data into standard SQL data types can be programmatic. In other words, when a partition is created and the data backing it fully loaded into HDFS, a persistent Hive client can wake up, being notified of the new data and grab that data to load into Teradata. Step Five: Invoke Parallel Transport (Q1, 2013)Coming in the first quarter of 2013 or soon thereafter, Teradata and Hortonworks Data Platform will communicate using Teradata’s parallel transport mechanism. This will provide the same performance benefits as parallel ingest but for the final step in the dataflow. For now, systems integrators and/or Teradata and Hortonworks team members can implement a few DFS clients to load chunks or segments of the table data into Teradata in parallel.
  10. Example: hi tech surveys, customer sat and product satSurveys have multiple-choice and freeformInput and analyze the plain-text sectionsJoin cross-channel support requests and device telemetry back to customerAnother example: wireless carrier and “golden path”
  11. Example: retail custom homepageClusters of related productsSet up models in Hbase that influence when user behaviors trigger recommendationsOR inform users when they enter of custom recommendations
  12. Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  13. Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  14. Buzz about low latency access in Hadoop
  15. Attribute: http://www.flickr.com/photos/adavey/2919843490/sizes/o/in/photostream/
  16. Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
  17. But beyond Core Hadoop, Hortonworkers are also deeply involved in the ancillary projects that are necessary for more general usage.As you can see, in both code count as well as committers, we contribute more than any others to Core Hadoop. And for the other key projects such as Pig, Hive, Hcatalog, Ambari we are doing the same.This community leadership across both core Hadoop and also the related open source projects is crucial in enabling us to play the critical role in turning Hadoop into Enterprise Hadoop.
  18. So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
  19. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring