4. What’s a Distribution?
How many of you get your apache
httpd from apache.org?
Pretty much everyone uses Linux
distributions to get software
CDH is a Hadoop distribution in the
same way that Ubuntu is a Linux
distribution
5. What is CDH?
Apache Hadoop and its ecosystem,
packaged up and easier to install
RPM, Debian, and tarball installs
Better Linux citizenship
Maintained and tested patch series on
top of upstream
Ecosystem compatibility guarantees
7. CDH - Included Packages
Apache Hadoop (MR, HDFS, and
Common)
Apache Pig
Apache Hive
Cloudera Desktop
HBase and ZooKeeper (contributed by
HBase team)
... more to come
8. Installation Options
APT and Yum repositories
apt-get install hadoop
yum install hadoop
hadoop-conf-pseudo package to get
started
tarball
9. CDH on Amazon EC2
hadoop-ec2 launch-cluster
todd-cluster 20
Support for HDFS on EBS volumes
(better performance than S3)
Cloudera Desktop automatically
installed and launched
Great if your data is already on EBS or
S3
10. CDH on Amazon EC2
hadoop-ec2 launch-cluster
todd-cluster 20
Support for HDFS on EBS volumes
(better performance than S3)
Cloudera Desktop automatically
installed and launched
Great if your data is already on EBS or
S3
Soon to come: VMware (vCloud) and
Rackspace
11. Linux citizenship
Hadoop should act like other software
you’re used to
Configuration using alternatives in
/etc
Logs in /var/log
Start/stop with init.d services
12. Patches in CDH
Get bug fixes early
Backport “Safe” new features
Sqoop, MRUnit
Fair Scheduler on 18
/metrics servlet
S3 fixes
etc...
Backport “Really Safe” performance
patches
13. What exactly am I getting?
Hadoop in CDH is still Apache 2.0
Read the changelog:
...hadoop-0.20/cloudera/CHANGES.cloudera.txt
Read the patches:
...hadoop-0.20/cloudera/patches/
Build it yourself:
...hadoop-0.20/cloudera/do-release-build
16. Is this a fork?
No way!
All functionality patches submitted
upstream (some build-system patches
only apply to our build)
We employ 2 committers fulltime, plus
several contributors
We regularly meet and work with other
community members from Yahoo!,
Facebook, etc.
17. My one commercial plug
...gotta pay the bills
We provide paid support for CDH
Someone to call if your cluster is down
Access to knowledgeable Hadoop
engineers
Configuration and tuning help
Process design reviews
Prioritize patches you need (and hot
fixes for critical issues)
</salesman>
19. Versions of CDH
Debian versioning scheme
stable
no new features, lots of “soak time”
comparable to RHEL 5, Ubuntu LTS, or
Debian stable
recommended for critical production
deployments
20. Versions of CDH
Debian versioning scheme
testing
considered usable - testing, not
untested!
has whiz-bang features and newer
versions
recommended for shops who like the
bleeding edge, or for those in PoC/dev
stage
21. Versions of CDH
CDH1 (stable)
Released March ’09
Hadoop 0.18.3, Hive 0.3, Pig 0.2
Will become oldstable this winter
CDH2 (testing)
Released June ’09
Hadoop 0.18.3, Hadoop 0.20.1, Pig 0.5,
Hive 0.4, HBase 0.20
Can install 0.18 and 0.20 at the same
time
Will become stable this winter
22. CDH2 Package Versioning
hadoop-0.18-0.18.3+65-1.cloudera.noarch.rpm
A hadoop package based on Apache Hadoop
0.18.3 with 65 patches
hadoop-0.20-0.20.0+4.4-1.cloudera.noarch.rpm
A hadoop package based on Apache Hadoop
0.20.0 with 4 patches in testing, 4
security/critical fixes
23. Where do I get CDH?
http://archive.cloudera.com/