The document is an introduction to analytics and big data using Hadoop presented by Geoff Fawkes. It discusses the challenges of large amounts of data, how Hadoop addresses these challenges through its HDFS distributed file system and MapReduce programming model. It provides examples of how companies use Hadoop for applications like analyzing customer behavior from set top cable boxes or performing sentiment analysis on product reviews. The presentation recommends further reading on analytics, big data, and data science topics.
Housekeeping: Keep your mobile devices on, turn up the ringer volume really loud, tweet, checkin on foursquare, update your facebook as I speak – we now live in a multi-tasking world so I’m ok with interruptions.
Ask questions. If I don’t have the answer, someone else may, and you can drop me an email after.
How many pages?!
Introductory presentation for new hires at Teradata. Mixture of business and engineering concepts. Scratch the surface – references at the end of presentation.
Zettabyte = 10 to the power of 21
Teradata used Tableau
Baidu is chinese language version of Google.
William Gibson, author, poet quote. Coined the term “cyberspace” in his 1982 book Neuromancer. Predicted the rise and popularity of reality TV.
Structured Data – defined format, such as XML document or database tables
Semi Structured Data – May be a schema but often ignored eg. spreadsheet, in which cells/fields can store any type of data
Unstructured Data – no particular internal structure eg. plain text, image tile, twitter feed.
80% of Big Data is unstructured.
If Gartner says so, it must be right ;>) Motivations for Hadoop:
Huge dependency on network and huge bandwidth demands
Scaling up and down is not a smooth process
Partial failures are difficult to handle
A lot of processing power is spent on transporting data
Data synchronization is required during exchange
As a developer you should not be worrying about these issues being handled by your application - - these are the problems that Hadoop solves, leaving you to focus on business logic.
Basic I/O problem – while storage capacity of hard drives has increased, access speed (rate at which data can be read), has not.
Eg. 1 TB drives are normal, but at 100 mega/bits transfer would take 2.5 hours to read all the data on the drive.
The world continues to move towards commodity hardware.
Commercial companies focused on developing and supporting Hadoop: Hortonworks, Cloudera, Amazon Web Services (AWS)
In more simplistic terms, Hadoop is a framework that facilitates functioning of several machines together to achieve the goal of analyzing large sets of data.
Hadoop framework supports reliability and data motion. MapReduce divides an application’s retrieval of data into many small fragments of work, each executed or re-executed on a node in the cluster. Data is stored on many compute nodes, providing very high aggregate bandwidth across the cluster for HDFS. Node failures are automatically handled by the framework, through parallelism, heartbeat, checksum and replication.
The Hadoop platform consists of: Hadoop kernel (implemented in Java), MapReduce (any programming language used) and HDFS (Hadoop Distributed File System). HDFS can be accessed natively through a Java API for applications to use (a C language wrapper is also available)
Ext3 – Third extended file system commonly used by Linux kernel is supported
Xfs – Journaling file system supporting 64-bit and parallel I/O
Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). An HDFS Block is greater than a Disk Block to minimize cost of seeks to disk.
HDFS files are write-once. Once written are closed and cannot be changed. A typical single file in HDFS is Gigabytes-to-Terabytes in size.
Terminology. A set of machines is a Hadoop cluster, using Master-Slave architecture.
Each node in a Hadoop Instance, has a single NameNode and a cluster of DataNodes. A NameNode is the software to maintain file system structure and metadata for the Datanodes. A Datanode is the software to store and retrieve blocks of data. Can be up to 4,000 slave DataNodes per NameNode.
NameNode Job Tracker takes care of MapReduce task execution tracking. DataNode Task Tracker takes care of MapReduce processing for write/read requests. NameNode does not require a lot of disk space, but requires a lot of RAM (the brains of the Instance). DataNode does not require a lot of RAM, but requires a lot of disk space.
Failover – the transition from active NameNode to secondary/standby NameNode by a failover controller such as Zookeeper.
HDFS is designed to run on commodity hardware. Low cost servers running Linux/Apache.
Philosophy of the cluster design is to bring computing as close as possble to the data.
All HDFS communication protocols are layered on top of the TCP/IP protocol. NameNode and Datanodes can be located anywhere.
A single instance is a single HDFS cluster.
A single instance is a single HDFS cluster.
Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB).
Hardware and data corruption is the norm, rather than the exception. An HDFS instance may consist of hundreds or 1000s of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS are always non-functional (dead).
By default, each block is replicated 3 times (can be changed by application in configuration). Replica placement is heavily studied for optimization - HDFS’s policy is to put one replica on one node in the local rack and distribute other replicas to other nodes and other racks, with the goal to reduce seek times, and encourage cluster rebalancing.
Separate from file operations, the NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.
If NameNode itself crashes, backup will have to be restored from disk. The Zookeeper tool provides NameNode failover coordination, through high availability of active/passive NameNodes.
Analogy to UNIX is a large distributed pipeline
Map Server/Function 1, Map Server/Function 2, Map Server/Function 3: each process in parallel
In MapReduce, every input is viewed as a Key-Value pair. Eg. Key=Sentence 1, Value=“John has a red car, which has no radio”.
Step 1 – Each sentence is given to a Map, and each word is counted in a wave. In this example, there are 3 Map jobs.
Step 2 – Shuffle and sort simply moves the words to server locations, where all the unique keys are brought together.
Step 3 – The words on each server are aggregated, and reduced. In this example Reduce is performed across two waves. Final output on lower right.
As a developer you have to start thinking about your data storage problem in a distributed way, instead of in a monolithic way.
Step 1 – data is broken into file splits of 64 MB (or 128 MB) and the blocks are moved to different NameNodes
Step 2 – Once all the blocks are moved, the Hadoop framework passes on your program to each NameNode
Step 3 – Job Tracker then starts scheduling the programs on individual Datanodes
Step 4 – Once all the Datanodes are done, the output (yellow) is written back
Also built on top of Hadoop, are the helper applications:
Hive – interactive SQL query and modeling using datawarehouse view of HDFS. Projects a table structure on the dataset and then manipulates it with HiveQL.
Pig – Data flow for tedious MapReduce jobs. A language for expressing data analysis and infrastructure processes.
HBase – Columnar NoSQL store for billions of rows
HCatalog – Table and schema management
Zookeeper – NameNode to backup failover coordination
Ambari – management tool
Download commercial implementations: Hortonworks (Sandbox is a single node download), Cloudera, Amazon services
Question is not “Why should I care about Big Data”, but rather, how can I get closer to Big Data and start taking advantage of it.
Thanks to Peter Smith and Michel Ng to organizing. If you have a topic you would like to present on, see Peter – contribute your expertise to the tech ecosystem in Vancouver
Send me questions via LinkedIn and copy will be posted to my profile
Hootsuite, Quickmobile, a few others in Vancouver looking for analytics developers – have a look