Webinar: The Future of Hadoop

With a community of over 500 contributors, Apache Hadoop and related projects are evolving at an ever increasing rate. Join the co-creator of Apache Hadoop, Doug Cutting, and Cloudera’s Chief Scientist, Jeff Hammerbacher, for a discussion of the most exciting new features being developed by the Apache Hadoop community.

  1. The Future of Hadoop Doug Cutting | A Founder of Apache HadoopJeff Hammerbacher | Chief Scientist, Cloudera Welcome to the webinar! Audio/Telephone: +1 (215) 383-1016 Access Code: 421-634-457 Audio Pin: Shown after joining the Webinar Hadoop, Hbase, Pig, Hive, Bigtop, Avro, Flume & Whirr are trademark of the Apache Software Foundation
  2. Housekeeping▪ All lines are on mute▪ Ask questions at any time using the Questions panel on GoToMeeting▪ Slides and recording will be available on www.cloudera.com/events ©2011 Cloudera, Inc. All Rights Reserved.
  3. Presentation Outline▪ 1. Context▪ 2. Apache Bigtop▪ 3. Apache Hadoop Core▪ 4. Apache HBase, Hive, and Pig▪ 5. Other components▪ Questions and Discussion ©2011 Cloudera, Inc. All Rights Reserved.
  4. 1. Context
  5. ContextData▪ 1.8 ZB will be created and replicated in 2011 ▪ Up 9x in the last five years ▪ More than 90% of this data is unstructured ▪ Enterprises have some liability for 80% of this data ▪ Enterprises will spend $4T on managing data in 2011 ▪ Source: IDC Digital Universe Report 2011 ©2011 Cloudera, Inc. All Rights Reserved.
  6. ContextHadoop▪ Apache Hadoop and related software are designed for this world▪ Volume ▪ Commodity hardware and open source software lowers cost and increases capacity▪ Velocity ▪ Data ingest speed aided by append-only and schema-on-read design▪ Variety ▪ Multiple tools to structure, process, and access data ©2011 Cloudera, Inc. All Rights Reserved.
  7. ContextHadoop
  8. ContextHDFS and MapReduce▪ Apache Hadoop = HDFS + MapReduce ▪ Similar to kernel of an operating system ▪ Referred to as “Hadoop Core”▪ Related components are often deployed with Hadoop ▪ For example: HBase, Hive, Pig, Oozie, Flume, Sqoop ▪ Together, these components form a “Hadoop Stack” ▪ Not all components must be deployed
  9. ContextBigtop▪ What standards should all components follow?▪ How can we ensure all components of the stack work together?▪ How can we find the right version of each component?▪ How can we make it easy to install an additional component?
  10. 2. Apache Bigtop
  11. Apache Bigtop▪ Now incubating at Apache▪ Hadoop ecosystem-wide project, including: ▪ Interoperability testing of components ▪ Packaging of compatible versions of components▪ Like a Fedora, Debian or CentOS for Hadoop ecosystem▪ Releases are not a single artifact ▪ Rather a set of interdependent, compatible components ©2011 Cloudera, Inc. All Rights Reserved.
  12. Apache Bigtop▪ Current components ▪ Hadoop ▪ HBase ▪ Hive ▪ Pig ▪ Oozie ▪ Sqoop ▪ Flume ▪ ZooKeeper ▪ Whirr
  13. Apache Bigtop▪ Outputs ▪ Source ▪ RPM ▪ Deb▪ Tests ▪ Integration ▪ Package ▪ Smoke▪ Release 0.1.0 under vote now!
  14. 3. Apache Hadoop Core
  15. Apache Hadoop Core▪ Current stable releases based on branches from 0.20▪ Upcoming release: 0.22 ▪ Includes both security and new implementation of append ▪ Not expected to be run at scale or commercially supported ▪ Nearly ready for vote▪ Upcoming release: 0.23 ▪ Build and dependency management moved to Maven ▪ Branch to happen soon
  16. HDFS▪ Robustness ▪ HDFS-1073: Checkpointing of image and edits log▪ Availability ▪ HDFS-1623: High availability▪ Performance ▪ HDFS-941: Faster random reads ▪ HDFS-2080: Faster checksums ©2011 Cloudera, Inc. All Rights Reserved.
  17. HDFS▪ Scalability ▪ HDFS-1052: Federation of the NameNode ▪ Source of diagram: http://www.hortonworks.com/an-introduction-to-hdfs-federation/
  18. MapReduce▪ Modularity ▪ MAPREDUCE-279: MapReduce 2.0 ▪ Break JobTracker into ResourceManager and ApplicationMaster ▪ Replace TaskTracker with NodeManager ▪ Source of diagram: http://www.odbms.org/download/dean-keynote-ladis2009.pdf
  19. MapReduce▪ Potential New Frameworks ▪ MAPREDUCE-2719: Distributed shell ▪ MAPREDUCE-2720: Distributed Java commands ▪ MPI: Communication-intensive parallelism ▪ Fast scans and aggregations ▪ OpenDremel ▪ Bulk Synchronous Parallel ▪ Giraph, Golden Orb, Hama, et al. ▪ Actor Model (streaming) ▪ S4, Akka, Storm, et al.
  20. 4. HBase, Hive, and Pig
  21. Apache HBase▪ Upcoming release: 0.92.0▪ Server-side triggers ▪ HBASE-2000: Coprocessors▪ Availability ▪ HBASE-1730/4213: Online schema changes▪ Performance ▪ HBASE-3857: HFile 2.0▪ HBase book in September! ©2011 Cloudera, Inc. All Rights Reserved.
  22. Apache Hive▪ Upcoming release: 0.8▪ Data transfer ▪ HIVE-306: INSERT INTO ▪ HIVE-1918: EXPORT/IMPORT▪ Indexes ▪ HIVE-1644: Automatically use indexes ▪ HIVE-1803: Bitmap indexes▪ Data formats ▪ HIVE-895: Avro support ©2011 Cloudera, Inc. All Rights Reserved.
  23. Apache Pig▪ Recent release: 0.9▪ Scripting ▪ PIG-1479: Embedding Pig in Python ▪ PIG-1793: Macro expansion▪ Debugging ▪ PIG-1712: ILLUSTRATE rework▪ Data formats ▪ PIG-1748: Avro support ©2011 Cloudera, Inc. All Rights Reserved.
  24. 5. Other Components
  25. Other Components▪ Apache Incubator ▪ Sqoop, Flume, and Oozie now incubating ▪ Whirr graduated to a top-level Apache project▪ Apache Avro ▪ Interoperability with Protocol Buffers and Thrift ▪ Column-oriented file format ▪ Python MapReduce implementation▪ Apache ZooKeeper ▪ Multi-update ▪ Kerberos authentication of clients ©2011 Cloudera, Inc. All Rights Reserved.
