2. Who am I?
http://www.mapr.com/company/events/
speaking/pdb-10-16-12
• Keys Botzum
• kbotzum@maprtech.com
• Senior Principal Technologist, MapR Technologies
• MapR Federal and Eastern Region
3. Agenda
• What’s a Hadoop?
• What’s MapR?
• Enterprise Grade Hadoop
• Making Hadoop More Open
5. How to Scale?
Big Data has Big Problems
• Petabytes of data
• MTBF on 1000s of nodes is < 1 day
• Something is always broken
• There are limits to scaling Big Iron
• Sequential and random access just don’t scale
6. Example: Update 1% of 1TB
• Data consists of 1010 records, each 100 bytes
• Task: Update 1% of these records
7. Approach 1: Just Do It
• Each update involves read, modify and write
• t = 1 seek + 2 disk rotations = 20ms
• 1% x 1010 x 20 ms = 2 mega-seconds = 23 days
• Total time dominated by seek and rotation times
8. Approach 2: The “Hard” Way
• Copy the entire database 1GB at a time
• Update records on the fly
• t = 2 x 1GB / 100MB/s + 20ms = 20s
• 103 x 20s = 20,000s = 5.6 hours
• 100x faster to do 100x more work!
• Moral: Read data sequentially even if you only want 1%
of it
9. MapReduce: A Paradigm Shift
• Distributed computing platform
• Large clusters
• Commodity hardware
• Pioneered at Google
• BigTable, MapReduce and Google File System
• Commercially available as Hadoop
10. Hadoop
• Commodity hardware – thousands of nodes
• Handles Big Data – petabytes and more
• Sequential file access – each spindle provides data as fast as
possible
• Sharding
• Data distributed evenly across cluster
• More spindles and CPUs working on different parts of data set
• Reliability – self-healing (mostly), self-balancing
• MapReduce
• Parallel computing framework
• Function shipping
§ Moves the computation to the data rather than the typical
reverse
§ Takes into account sharding
• Hides most of complexity from developers
11. Inside Map-Reduce
the,
1
"The
6me
has
come,"
the
Walrus
said,
6me,
1
"To
talk
of
many
things:
come,
[3,2,1]
has,
1
Of
shoes—and
ships—and
shas,
[1,5,2]
ealing-‐wax
come,
1
come,
6
the,
[1,2,1]
has,
8
…
6me,
[10,1,3]
the,
4
…
6me,
14
Input
Map
Shuffle
Reduce
…
Output
and
sort
12. Agenda
• What’s a Hadoop?
• What’s MapR?
• Enterprise Grade Hadoop
• Making Hadoop More Open
13. The MapR Distribution for Apache Hadoop
• Commercial Hadoop Distribution
• Open, enterprise-grade distribution
• Primarily leveraging open source components
• Carefully targeted enhancements to make Hadoop more
open and enterprise-grade
• Growing fast and a recognized leader
14. MapR in the Cloud
• Available as a service with Amazon Elastic MapReduce
(EMR)
• http://aws.amazon.com/elasticmapreduce/mapr
§ Available
as
a
service
with
Google
Compute
Engine
16. Agenda
• What’s a Hadoop?
• What’s MapR?
• Enterprise Grade Hadoop
• Making Hadoop More Open
17. MapR’s Complete Distribution
for Apache Hadoop
MapR Control System
• Integrated, tested,
hardened and supported MapR
Heatmap™
LDAP, NIS
Integration
Quotas, CLI,
REST APT
Alerts, Alarms
• Integrated with
Accumulo
Hive Pig Oozle Sqoop HBase Whirr
• Runs on commodity
hardware
• Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
standards-based
extensions for:
• Security
• File-based access
Direct Snap-
• Most SQL-based Access
Real- Volumes Mirrors Data
Time shots Placemen
access NFS Streamin t
• Easiest integration g
No NameNode High Performance Stateful Failover
• High availability Architecture Direct Shuffle and Self Healing
• Best performance
MapR’s Storage Services™
2.7
18. Easy Management at Scale
• Health
Monitoring
• Cluster
Administration
• Application
Resource
Provisioning
Same information and tasks available via
command line and REST
19. MapR: Lights Out Data Center Ready
Dependable
Reliable Compute
Storage
• Automated
stateful
failover
§ Business
con6nuity
with
snapshots
and
mirrors
• Automated
re-‐replica6on
§ Recover
to
a
point
in
6me
• Self-‐healing
from
HW
§ End-‐to-‐end
check
and
SW
failures
summing
• Load
balancing
§ Strong
consistency
§ Built
in
compression
• Rolling
upgrades
§ Mirror
across
sites
to
• No
lost
jobs
or
data
meet
• 99999’s
of
up6me
Recovery
Time
Objec6ves
20. Storage Architecture
§ How
does
MapR
manage
storage
and
how
is
this
different
from
generic
Hadoop?
27. MapR Mirroring/COOP Requirements
Business
Con6nuity
Production Research and
Efficiency
Efficient
design
WAN § Differen6al
deltas
are
updated
Datacenter
1
Datacenter
1
§ Compressed
and
check-‐summed
Easy
to
manage
Production
WAN
Cloud § Scheduled
or
on-‐demand
§ WAN,
Remote
Seeding
§ Consistent
point-‐in-‐6me
Compute Engine
28. Thought Questions
• Consider a cluster with
• Petabytes of data
• Hundred or thousands of jobs running each day, creating new data
• Many users and teams all using this cluster
• How do I back this up?
• User “oops” protection
• How do I replicate data from one cluster to another in support of disaster
recovery?
• Protection from power outages, floods, fire, etc
30. Customer Support
• 24x7x365 “Follow-The-Sun” coverage
• Critical customer issues are worked on
around the clock
• Dedicated team of Hadoop engineering
experts
• Contacting MapR support
• Email: support@mapr.com
(automatically opens a case)
• Phone: 1.855.669.6277
• Self Service options:
§ http://answers.mapr.com/
§ Web Portal: http://mapr.com/
support
31. Two MapR Editions – M3 and M5
§ Control
System
§ Control
System
§ NFS
Access
§ NFS
Access
§ Performance
§ Performance
§ Unlimited
Nodes
§ High
Availability
§ Free
§ Snapshots
&
Mirroring
§ 24
X
7
Support
Also Available through:
§ Annual
Subscrip6on
Compute Engine
32. Agenda
• What’s a Hadoop?
• What’s MapR?
• Enterprise Grade Hadoop
• Making Hadoop More Open
44. Customer Examples: Import/Export Data
• Network security vendor
• Network packet captures from switches are streamed into the cluster
• New pattern definitions are loaded into online IPS via NFS
• Online measurement company
• Clickstreams from application servers are streamed into the cluster
• SaaS company
• Exporting a database to Hadoop over NFS
• Ad exchange
• Bids and transactions are streamed into the cluster
45. Customer Examples: Productivity and Operations
• Retailer
• Operational scripts are easier with NFS than HDFS + MapReduce
§ chmod/chown, file system searches/greps, perl, awk, tab-complete
• Consolidate object store with analytics
• Credit card company
• User and project home directories on Linux gateways
§ Local files, scripts, source code, …
§ Administrators manage quotas, snapshots/backups, …
• Large Internet company recommendation system
• Web server serve MapReduce results (item relationships) directly from
cluster
• Email marketing company
• Object store with HBase and NFS
47. Latency Matters
• Ad-hoc analysis with interactive tools
• Real-time dashboards
• Event/trend detection and analysis
• Network intrusion analysis on the fly
• Fraud
• Failure detection and analysis
48. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
Introducing Apache Drill…
49. Innovations
• MapReduce
• Scalable IO and compute trumps efficiency with today's commodity hardware
• With large datasets, schemas and indexes are too limiting
• Flexibility is more important than efficiency
• An easy to use scalable, fault tolerant execution framework is key for large
clusters
• Dremel
• Columnar storage provides significant performance benefits at scale
• Columnar storage with nesting preserves structure and can be very efficient
• Avoiding final record assembly as long as possible improves efficiency
• Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. No need to start JVMs, just push compact queries to
running agents.
• Apache Drill
• Open source project based upon Dremel’s ideas
• More flexibility and openness