Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

BIG DATA AND HADOOP
History, Technical Deep Dive, and Industry
Trends
Esther Kundin
Bloomberg LP

Outline
• What Is Big Data?
• A History Lesson
• Hadoop – Dive in to the details
• HDFS
• MapReduce
• HBase
• Industry Trends
• Questions

Big Data Origins
• Indexing the web requires lots of storage
• Petabytes of data!
• Economic problem – reliable servers expensive!
• Solution:
• Cram in as many cheap machines as possible
• Replace them when they fail
• Solve reliability via software!

Big Data Origins Cont’d
• DBs are slow and expensive
• Lots of unneeded features
RDBMS NoSQL
ACID Eventual
consistency
Strongly-typed No type checking
Complex Joins Get/Put
RAID storage Commodity
hardware

Big Data Origins Cont’d
• Google publishes papers about:
• GFS (2000)
• MapReduce (2004)
• BigTable (2006)
• Hadoop, originally developed at Yahoo, accepted as
Apache top-level project in 2008

Translation
GFS HDFS
MapReduce Hadoop MapReduce
BigTable HBASE

Why Hadoop?
• Huge and growing ecosystem of services
• Pace of development is swift
• Tons of money and talent pouring in

Hadoop Ecosytem
• HDFS – Hadoop Distributed File System
• Pig: a scripting language that simplifies the creation of MapReduce
jobs and excels at exploring and transforming data.
• Hive: provides SQL-like access to your Big Data.
• HBase: Hadoop database .
• HCatalog: for defining and sharing schemas .
• Ambari: for provisioning, managing, and monitoring Apache Hadoop
clusters .
• ZooKeeper: an open-source server which enables highly reliable
distributed coordination .
• Sqoop: for efficiently transferring bulk data between Hadoop and
relation databases .
• Oozie: a workflow scheduler system to manage Apache Hadoop jobs
• Mahout : scalable machine learning library

HDFS
• Hadoop Distributed File System
• Basis for all other tools, built on top of it
• Allows for distributed workloads

MapReduce demo
• To run, can use:
• Custom JAVA application
• PIG – nice interface
• Hadoop Streaming + any executable, like python
• Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-
program-in-python/
• HIVE – SQL over MapReduce – “we put the SQL in NoSQL”

HBase
• Database running on top of HDFS
• NOSQL – key/value store
• Distributed
• Good for sparse requests, rather than scans like MapReduce
• Sorted
• Eventually Consistent

HBase Architecture
Client
ZK Quorum
ZK Peer
ZK Peer
ZK Peer
HMaster
HMaster
Meta Region
Server
RegionServer RegionServer RegionServer
HDFS

HBase Read
Client
ZK Quorum
ZK Peer
ZK Peer
ZK Peer
HMaster
HMaster
Meta Region
Server
HDFS
Client requests Meta
Region Server
address

HBase Architecture
Client
ZK Quorum
ZK Peer
ZK Peer
ZK Peer
HMaster
HMaster
Meta Region
Server
HDFS
Client determines
Which RegionServer
to contact and caches
that data

HBase Architecture
Client
ZK Quorum
ZK Peer
ZK Peer
ZK Peer
HMaster
HMaster
Meta Region
Server
HDFS
Client requests data
from the Region
Server, which gets
data from HDFS

HMaster
• Only one main master at a time – ensured by zookeeper
• Keeps track of all table metadata
• Used in table creation, modification, and deletion.
• Not used for reads

Region Server
• This is the worker node of HBase
• Performs Gets, Puts, and Scans for the regions it handles
• Multiple regions are handled by each Region Server
• On startup
• Registers with zookeeper
• Hmaster assigns it regions
• Physical blocks on HDFS may or may not be on the same machine
• Regions are split if they get too big
• Data stored in a format called Hfile
• Cache of data is what gives good performance. Cache
based on blocks, not rows

HBaseWrite – step 1
Region Server
WAL (on
HDFS)
MemStore
HFile
HFile
HFile
Region Server
persists write at
the end of the
WAL

HBaseWrite – step 2
Region Server
WAL (on
HDFS)
MemStore
HFile
HFile
HFile
Regions Server
saves write in a
sorted map in
memory in the
MemStore

HBaseWrite – offline
Region Server
WAL (on
HDFS)
MemStore
HFile
HFile
HFile
When MemStore reaches
a configurable size, it is
flushed to an HFile

Minor Compaction
• When writing a MemStore to Hfile, may trigger a Minor
Compaction
• Combine many small Hfiles into one large one
• Saves disk reads
• May block further MemStore flushes, so try to keep to a
minimum

Major Compaction
• Happens at configurable times for the system
• Ie. Once a week on weekends
• Default to once every 24 hrs
• Resource-intensive
• Don’t set it to “never”
• Reads in all Hfiles and makes sure there is one Hfile per
Region per column family
• Purges deleted records
• Ensures that HDFS files are local

Tuning your DB - HBase Keys
• Row Key – byte array
• Best performance for Single Row Gets
• Best Caching Performance
• Key Design –
• Distributes well – usually accomplished by hashing natural key
• MD5
• SHA1

Tuning your DB - BlockCache
• Each region server has a BlockCache where it stores file
blocks that it has already read
• Every read that is in the block increases performance
• Don’t want your blocks to be much bigger than your rows
• Modes of caching:
• 2-level LRU cache, by default
• Other options: BucketCache – can use DirectByteBuffers to
manage off-heap RAM – better Garbage Collection stats on the
region server

Tuning your DB - Columns and Column
Families
• All columns in a column families accessed together for
reads
• Different column families stored in different HFiles
• All Column Families written once when any MemStore is
full
• Example:
• Storing package tracking information:
• Need package shipping info
• Need to store each location in the path

Tuning your DB – Bloom Filters
• Can be set on rows or columns
• Keep an extra index of available keys
• Slows down reads and writes a bit
• Increases storage
• Saves time checking if keys exist
• Turn on if it is likely that client will request missing data

Tuning your DB – Short-Circuit Reads
• HDFS exposes service interface
• If file is actually local, much faster to just read Hfile
directly off of the disk

Big Data in Finance – the challenges
• Real-Time financial analysis
• Reliability
• “medium-data”

What Bloomberg is Working on
• Working with Hortonworks on fixing real-time issues in
Hadoop
• Creating a framework for reliably serving real-time data
• Presenting at Hadoop World and Hadoop Summit
• Open source Chef recipes for running a hadoop cluster on
OpenStack-managed VMs

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Similaire à Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends (20)

Dernier

Dernier (20)

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Notes de l'éditeur