From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Overview of Hadoop and HDFS
1. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Introduction, Background to Hadoop and HDFS!
!
!
!
!
Brendan Tierney
2. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
What is Big Data?
O’Reilly Radar definition:
• Big data is when the size of the data itself becomes part of the problem
EMC/IDC definition:
• Big Data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large
volumes of a wide variety of data, by enabling high velocity capture,
discovery and/or analysis
• McKinsey definition:
• Big Data refers to datasets whose size is beyond the availability of typical
database software tools to capture, store, manage and analyse
http://www.oreilly.com/data/free/big-data-now-2012.csp!
http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf!
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation!
http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
3. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Big Data
Some Companies continue to generate large amounts of data:
• Facebook ~ 6 billion messages per day
• EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage
• Satellite Images by Skybox Imaging ~ 1 Terabyte per day
• These numbers are probably out of date before I finished writing this slide
Important : This is for some companies and not all companies
Part of their data management architecture. It will not replace existing DBs etc
4. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Basic idea
• The basic idea behind the phrase Big Data is that everything we do is increasingly
leaving a digital trace (data) which we can use and analyse
• Big Data therefore refers to our ability to make use of ever increasing volumes of
data
Traditional data storage methods can
be a challenge!
Why ?
9. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop
• Existing tools were not designed to handle such large amounts of data
• "The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.”
• http://hadoop.apache.org
• – Process Big Data on clusters of commodity hardware
• – Vibrant open-source community
• – Many products and tools reside on top of Hadoop
10. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Who is using Hadoop in Ireland ?
Big websites
Big telcos
Big Banks
Big Financial
CERN
Big ….
11. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Access Speeds?
1990:
Typical drive ~1370MB
Transfer speed ~ 4.4MB/s
read drive in 5 mins
2010:
Typical drive ~1TB
Transfer speed ~ 100MB/s
read drive in 2.5 hrs
Hadoop - 100 drives working
at the same
time can read 1TB of data in 2
minutes
12. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Scaling issue
$
$
$
$ ?
13. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Scaling issue
• It is harder and more expensive to scale-up ( “It Depends” needs to be applied)
• Add additional resources to an existing node (CPU, RAM)
• Moore’s Law can’t keep up with data growth
• New units must be purchased if required resources can not be added
• Also known as scale vertically
• Scale-Out
• Add more nodes/machines to an existing distributed application
• Software Layer is designed for node additions or removal
• Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
• Very easy to scale down as well
14. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Principles
• Scale-Out rather than Scale-Up
• Bring code to data rather than data to code
• Deal with failures – they are common
• Abstract complexity of distributed and concurrent applications
• Self managing
• Auto parallel processing
15. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Big Data – Example Applications
Not all of these are using Hadoop or require Hadoop!
16. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Cluster
• A set of "cheap" commodity hardware
• Networked together
• Resides in the same location
• Set of servers in a set of racks in a data center
• “Cheap” Commodity Server Hardware
• No need for super-computers, use commodity unreliable hardware
• Not desktops
Yes you can build a Hadoop Cluster
using Raspberry Pi’s
17. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Abstracting Complexity
• Distributed Computing is HARD WORK
• Hadoop abstracts many complexities in distributed and concurrent applications
• Defines small number of components
• Provides simple and well defined interfaces of interactions between these
components
• Frees developer from worrying about system level challenges
• race conditions, data starvation
• processing pipelines, data partitioning, code distribution, etc.
• Allows developers to focus on application development and business logic
18. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop vs RDBMS
• Always keep the phrase
“It Depends” in mind when
discussing Big Data
• Hadoop != RDBMS
• Hadoop will not replace RDBMS
• Hadoop is part of your data
management architecture
• and only if it is needed !
19. RDBMS
Hadoop
Data size
Gigabytes
Petabytes
Access
Interactive & Batch
Batch
Updates
Read & write many
times
Write once, read
many times
Integrity
High
Low
Scaling
Non Linear
Linear
Data representation
Structured
Unstructured, semi-
structured
20. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Current trends for Hadoop
21. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Current trends for Hadoop
22. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Current trends for Hadoop
23. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Current trends for Hadoop
25. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Current trends for Hadoop
26. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Working together
• Hadoop and RDBMS frequently complement each other within an architecture
• For example, a website that
• has a small number of users
• produces a large amount of audit logs
27. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Ecosystem
28. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Ecosytems
29. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Ecosytems
30. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Distributions
• Large number of independent products (Apache projects)
• Can be challenging to get all/some of these to work together
• We will will be working with Hadoop, installing and using some products
• Hadoop Distributions aim to resolve version incompatibilities
• Distribution Vendor will
• Integration Test a set of Hadoop products
• Package Hadoop products in various installation formats
• Linux Packages, tarballs, etc.
• Distributions may provide additional scripts to execute Hadoop
• Some vendors may choose to backport features and bug fixes made by Apache
• Typically vendors will employ Hadoop committers so the bugs they find will make it
into Apache’s repository
31. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop Distributions
• Cloudera Distribution for Hadoop (CDH)
• Check out the pre-built VM with most of Cloudera products (Hadoop, etc)
• http://www.cloudera.com/downloads/quickstart_vms/5-8.html
• MapR Distribution
• Check out the MapR Sandbox VM
• https://www.mapr.com/products/mapr-sandbox-hadoop
• Hortonworks Data Platform (HDP)
• Check out the Hortonworks Sandbox VM
• http://hortonworks.com/products/sandbox/
• Oracle Big Data Applicance
• Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed
and configured for you to use
• http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
$
32. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Hadoop - “move-code-to-data” approach
• Data is distributed among the nodes as it is initially stored in the system
• Data is replicated multiple times on the system for increased reliability & availability
• Master allocates work to nodes
• Computation happens on the nodes where the data is stored - data locality
• Nodes work in parallel each on their own part of the overall dataset
• Nodes are independent and self-sufficient - shared-nothing architecture
• If a node fails, master detects the failure and re-assigns work to other nodes
• If a failed node restarts, it is automatically added back into the system and
assigned new tasks
33. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS
• A distributed file system modelled on the Google File System (GFS)
[http://research.google.com/archive/gfs.html]
• Data is split into blocks, typically 64MB or 128MB in size, spread across many
nodes
• Works better on large files >= 1 HDFS block in size
• Each block is replicated to a number of nodes (typically 3)
• ensures reliability and availability
• Files in HDFS are write once - no random writes to files allowed
• HDFS is optimised for large streaming reads of files - no random access to files
allowed
• see HIVE later on for more DBMS-type access to HDFS files....
34. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS is good for
• Storing large files
• Terabytes, Petabytes, etc...
• Millions rather than billions of files
• 100MB or more per file
• Streaming data
• Unstructured data => really mixed structured data
• Write once and read-many times patterns
• Schema on Read (RDBMS = schema on write)
• Huge time saving at data write time
• BUT !!!
• Optimized for streaming reads rather than random reads
• “Cheap” Commodity Hardware
• No need for super-computers, use less reliable commodity hardware
35. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS is not so good at
• Low-latency reads
• High-throughput rather than low latency for small chunks of data
• HBase and other DBs can address this issue (?)
• Large amount of small files
• Better for millions of large files instead of billions of small files
• Block size of 128M or 256M
• For example each file can be 100MB or more
• Multiple Writers
• Single writer per file
• Writes only at the end of file, no-support for arbitrary offset
• Time needed for replication
36. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS
• Two types of nodes in a HDFS cluster
• NameNode - the master node
• DataNodes - slave or worker nodes
• NameNode manages the file system
• keeps track of the metadata - which blocks make up a file (using 2 files - namespace
image and the edit log)
• knows on which DataNodes the blocks are stored
• DataNodes do the work
• store the blocks
• retrieve blocks when requested to (by the client or the NameNode)
• poll and report back to the NameNode periodically with the list of blocks that they are
storing
37. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS
• When a client application wants to read a file...
• it communicates with the NameNode to determine which blocks make up the file,
and on which DataNodes the block reside
• it then communicates directly with the DataNodes
• NameNode is the single point of failure of a Hadoop system
• backup periodically to remote NFS (setup as part of Hadoop configuration)
• use Secondary NameNode
• not the same as the NameNode
• periodically merges namespace with edit log and maintains a copy
39. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Files and Blocks
• Files are split into blocks (single unit of storage)
• Managed by Namenode, stored by Datanode
• Transparent to user
• Replicated across machines at load time
• Same block is stored on multiple machines
• Good for fault-tolerance and access
• Can lead to inconsistent reads
• Default replication is 3
Have you ever experienced
inconsistent reads?
40. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS File Writes
41. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
HDFS File Reads
42. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Who is using Hadoop in Ireland ?
• List of Cloudera customers in Ireland
• Citi
• Allianz
• Deutsche Bank
• Ulster Bank
• dun & bradstreet
• Ryanair
• BT
• Vodafone
• Novartis
• airbnb
• Dell
• Intel
• Rockwell Automation
• Revenue
• Adecco
• Experian
• M&S
43. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Discuss
Hadoop is not FREE J
vs
Hadoop is not FREE L
44. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Something to think about