"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are less robust or that have problems more often? Are you running Hadoop on bare metal, or in a cloud environment, and is one easier than the other?"
MapR Senior Solutions Architect David Tucker speaks about the challenges and capabilites in managing a cluster. This talk was given at the SF Bay Area Large Scale Production Engineering Meetup (Sept 19, 2013).
We all know about hadoop .. so no need to get specific there
We all know about hadoop .. so no need to get specific there
Another area of ease of use is with the MapR Control system and Heatmap. This simplifies health monitoring, cluster administration and application provisioning at scale. Each small rectangle in the UI represents a separate node. You can select a wide variety of elements to monitor include custom services. MapR also includes alerts and alarms so administrators are not required to constantly monitor. There are also filters and group operations to simplify actions.
With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self–healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.
The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
We all know about hadoop .. so no need to get specific there
MapR also uniquely provides full Snapshots. No other Hadoop distribution provides this capability. They provide replication that provides additional copies to protect against data loss but it does nothing to protect against application or user errors that are replicated across a cluster. With MapR you have a snapshot and point in time recovery. A user or administrator can simply open up the snapshot directory and recovery a full directory or individual file. The snapshots are provided on a redirect on write method which provides this protection without duplicating the data. In other words you can snapshot a 1 petabyte cluster in seconds with no additional data storage.
MapR is also the only distribution for Apache Hadoop that provides wide area replication and mirroring allowing you to provide full business continuity. MapR’s Hadoop distribution allows you to automatically and transparently mirror your data to another cluster. The system performs incremental synchronization of clusters on the changed data. That means there is very low overhead and higher performance. With MapR, you can also easily deploy a research cluster alongside a production cluster so that researchers, developers and analysts can experiment without impacting the production cluster. You can mirror between two clusters which are geographically separated for disaster recovery and implement your Recovery Time Objectives to assure business continuity. MapR’s mirroring also supports bulk data transfer to other clusters. Hadoop users today do not have a way to interoperate between private and public clouds. You can use MapR’s mirroring to synchronize data between a research cluster and your production cluster, or between a private and public cloud.
Snowden story : he got docs because he was administering a file server with classified information
We all know about hadoop .. so no need to get specific there
The MapR Control System also provides advanced job management capabilities, enabling an administrator to have complete visibility and control over the operation of the cluster, jobs and tasks. Unique capabilities of MapR Control System: AutomatedComprehensive – hw and software (Cloudera has no visibility into hardware faults)Full Visibility and controlSupports lights out operation