Jon Bloom, a senior consultant from Agile Bay, Inc., will present on big data and Hadoop. The session agenda includes definitions of big data and Hadoop, a comparison of BI and Hadoop, and a demo. Hadoop is an open source Apache project for distributed computing across commodity servers using batch processing. It includes HDFS, YARN, MapReduce and an ecosystem of tools like HBase and Hive. While Hadoop extends BI capabilities to massive datasets, BI is suited for datasets up to hundreds of gigabytes versus Hadoop which handles terabytes to petabytes of data.
5. Terms and Acronyms
Hadoop:
Apache project (open source) project to develop software for
reliable, scalable, distributed computing.
Cluster:
A group of computers (nodes) linked together to perform a highly-available
and high computation work
HDFS
distributed file system that provides high-throughput access to application
data.
YARN
A framework for job scheduling and cluster resource management.
MapReduce
A system for parallel processing of large data sets.
12. BI vs. Hadoop
Hadoop not a replacement of BI
Extends BI capabilities
BI = Scale up to 100s of Gigabytes
Hadoop = From 100s of Gygabytes to Terabytes
(1,000s og Gygabytes) and Terabytes (1,000,000
Gigabytes)