SlideShare une entreprise Scribd logo
1  sur  44
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Introduction, Background to Hadoop and HDFS!
!
!
!
!
Brendan Tierney
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

What is Big Data?
O’Reilly Radar definition:
•  Big data is when the size of the data itself becomes part of the problem
EMC/IDC definition:
•  Big Data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large
volumes of a wide variety of data, by enabling high velocity capture,
discovery and/or analysis
•  McKinsey definition:
•  Big Data refers to datasets whose size is beyond the availability of typical
database software tools to capture, store, manage and analyse
http://www.oreilly.com/data/free/big-data-now-2012.csp!
http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf!
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation!
http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data
Some Companies continue to generate large amounts of data:
•  Facebook ~ 6 billion messages per day
•  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage
•  Satellite Images by Skybox Imaging ~ 1 Terabyte per day
•  These numbers are probably out of date before I finished writing this slide
Important : This is for some companies and not all companies
Part of their data management architecture. It will not replace existing DBs etc
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Basic idea
•  The basic idea behind the phrase Big Data is that everything we do is increasingly
leaving a digital trace (data) which we can use and analyse
•  Big Data therefore refers to our ability to make use of ever increasing volumes of
data
Traditional data storage methods can
be a challenge!

Why ?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2013
2013
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2014
Where is 
Predictive Analytics?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

2015
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop
•  Existing tools were not designed to handle such large amounts of data
•  "The Apache™ Hadoop™ project develops open-source software for reliable,
scalable, distributed computing.” 
•  http://hadoop.apache.org
•  – Process Big Data on clusters of commodity hardware
•  – Vibrant open-source community
•  – Many products and tools reside on top of Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Who is using Hadoop in Ireland ?
Big websites

Big telcos

Big Banks

Big Financial

CERN

Big ….
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Access Speeds?
1990:
Typical drive ~1370MB
Transfer speed ~ 4.4MB/s

read drive in 5 mins
 2010:
Typical drive ~1TB
Transfer speed ~ 100MB/s

read drive in 2.5 hrs
Hadoop - 100 drives working
at the same
time can read 1TB of data in 2
minutes
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Scaling issue
$
$
$
$ ?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Scaling issue
•  It is harder and more expensive to scale-up ( “It Depends” needs to be applied)
•  Add additional resources to an existing node (CPU, RAM)
•  Moore’s Law can’t keep up with data growth
•  New units must be purchased if required resources can not be added
•  Also known as scale vertically
•  Scale-Out
•  Add more nodes/machines to an existing distributed application
•  Software Layer is designed for node additions or removal
•  Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
•  Very easy to scale down as well
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Principles
•  Scale-Out rather than Scale-Up
•  Bring code to data rather than data to code
•  Deal with failures – they are common
•  Abstract complexity of distributed and concurrent applications
•  Self managing
•  Auto parallel processing
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Big Data – Example Applications
Not all of these are using Hadoop or require Hadoop!
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Cluster
•  A set of "cheap" commodity hardware
•  Networked together
•  Resides in the same location
•  Set of servers in a set of racks in a data center
•  “Cheap” Commodity Server Hardware
•  No need for super-computers, use commodity unreliable hardware
•  Not desktops
Yes you can build a Hadoop Cluster
using Raspberry Pi’s
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Abstracting Complexity
•  Distributed Computing is HARD WORK
•  Hadoop abstracts many complexities in distributed and concurrent applications
•  Defines small number of components
•  Provides simple and well defined interfaces of interactions between these
components
•  Frees developer from worrying about system level challenges
•  race conditions, data starvation
•  processing pipelines, data partitioning, code distribution, etc.
•  Allows developers to focus on application development and business logic
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop vs RDBMS
•  Always keep the phrase
“It Depends” in mind when
discussing Big Data
•  Hadoop != RDBMS
•  Hadoop will not replace RDBMS
•  Hadoop is part of your data
management architecture
•  and only if it is needed !
RDBMS
 Hadoop
Data size
 Gigabytes
 Petabytes
Access
 Interactive & Batch
 Batch
Updates
Read & write many
times
Write once, read
many times
Integrity
 High
 Low
Scaling
 Non Linear
 Linear
Data representation
 Structured
Unstructured, semi-
structured
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Current trends for Hadoop
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Working together
•  Hadoop and RDBMS frequently complement each other within an architecture
•  For example, a website that
•  has a small number of users
•  produces a large amount of audit logs
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosystem
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosytems
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Ecosytems
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Distributions
•  Large number of independent products (Apache projects) 
•  Can be challenging to get all/some of these to work together
•  We will will be working with Hadoop, installing and using some products
•  Hadoop Distributions aim to resolve version incompatibilities
•  Distribution Vendor will
•  Integration Test a set of Hadoop products
•  Package Hadoop products in various installation formats
•  Linux Packages, tarballs, etc.
•  Distributions may provide additional scripts to execute Hadoop
•  Some vendors may choose to backport features and bug fixes made by Apache
•  Typically vendors will employ Hadoop committers so the bugs they find will make it
into Apache’s repository
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop Distributions
•  Cloudera Distribution for Hadoop (CDH)
•  Check out the pre-built VM with most of Cloudera products (Hadoop, etc)
•  http://www.cloudera.com/downloads/quickstart_vms/5-8.html
•  MapR Distribution
•  Check out the MapR Sandbox VM
•  https://www.mapr.com/products/mapr-sandbox-hadoop 
•  Hortonworks Data Platform (HDP)
•  Check out the Hortonworks Sandbox VM
•  http://hortonworks.com/products/sandbox/ 
•  Oracle Big Data Applicance
•  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed
and configured for you to use
•  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
$
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Hadoop - “move-code-to-data” approach
•  Data is distributed among the nodes as it is initially stored in the system
•  Data is replicated multiple times on the system for increased reliability & availability
•  Master allocates work to nodes 
•  Computation happens on the nodes where the data is stored - data locality
•  Nodes work in parallel each on their own part of the overall dataset
•  Nodes are independent and self-sufficient - shared-nothing architecture
•  If a node fails, master detects the failure and re-assigns work to other nodes
•  If a failed node restarts, it is automatically added back into the system and
assigned new tasks
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  A distributed file system modelled on the Google File System (GFS)

[http://research.google.com/archive/gfs.html]
•  Data is split into blocks, typically 64MB or 128MB in size, spread across many
nodes
•  Works better on large files >= 1 HDFS block in size
•  Each block is replicated to a number of nodes (typically 3)
•  ensures reliability and availability
•  Files in HDFS are write once - no random writes to files allowed
•  HDFS is optimised for large streaming reads of files - no random access to files
allowed 
•  see HIVE later on for more DBMS-type access to HDFS files....
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS is good for
•  Storing large files
•  Terabytes, Petabytes, etc...
•  Millions rather than billions of files
•  100MB or more per file
•  Streaming data
•  Unstructured data => really mixed structured data
•  Write once and read-many times patterns
•  Schema on Read (RDBMS = schema on write)
•  Huge time saving at data write time
•  BUT !!!
•  Optimized for streaming reads rather than random reads
•  “Cheap” Commodity Hardware
•  No need for super-computers, use less reliable commodity hardware
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS is not so good at
•  Low-latency reads
•  High-throughput rather than low latency for small chunks of data
•  HBase and other DBs can address this issue (?)
•  Large amount of small files
•  Better for millions of large files instead of billions of small files
•  Block size of 128M or 256M
•  For example each file can be 100MB or more
•  Multiple Writers
•  Single writer per file
•  Writes only at the end of file, no-support for arbitrary offset
•  Time needed for replication
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  Two types of nodes in a HDFS cluster
•  NameNode - the master node 
•  DataNodes - slave or worker nodes
•  NameNode manages the file system
•  keeps track of the metadata - which blocks make up a file (using 2 files - namespace
image and the edit log)
•  knows on which DataNodes the blocks are stored
•  DataNodes do the work
•  store the blocks
•  retrieve blocks when requested to (by the client or the NameNode)
•  poll and report back to the NameNode periodically with the list of blocks that they are
storing
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS
•  When a client application wants to read a file...
•  it communicates with the NameNode to determine which blocks make up the file,
and on which DataNodes the block reside
•  it then communicates directly with the DataNodes
•  NameNode is the single point of failure of a Hadoop system
•  backup periodically to remote NFS (setup as part of Hadoop configuration)
•  use Secondary NameNode 
•  not the same as the NameNode
•  periodically merges namespace with edit log and maintains a copy
[from Hadoop in Practice, Alex Holmes]
HDFS
Architecture
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Files and Blocks
•  Files are split into blocks (single unit of storage)
•  Managed by Namenode, stored by Datanode
•  Transparent to user
•  Replicated across machines at load time
•  Same block is stored on multiple machines
•  Good for fault-tolerance and access
•  Can lead to inconsistent reads 
•  Default replication is 3
Have you ever experienced
inconsistent reads?
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS File Writes
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

HDFS File Reads
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Who is using Hadoop in Ireland ?
•  List of Cloudera customers in Ireland
•  Citi
•  Allianz
•  Deutsche Bank
•  Ulster Bank
•  dun & bradstreet

•  Ryanair
•  BT
•  Vodafone
•  Novartis
•  airbnb
•  Dell
•  Intel
•  Rockwell Automation
•  Revenue
•  Adecco
•  Experian
•  M&S
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 


Discuss

Hadoop is not FREE J
vs 
Hadoop is not FREE L
www.oralytics.com 
t : @brendantierney 
e : brendan.tierney@oralytics.com 

 

Something to think about

Contenu connexe

Tendances

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 

Tendances (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop
HadoopHadoop
Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 

En vedette

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your dataBrendan Tierney
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productBrendan Tierney
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryBrendan Tierney
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016Brendan Tierney
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Random number generators
Random number generatorsRandom number generators
Random number generatorsBob Landstrom
 
Open Canary - novahackers
Open Canary - novahackersOpen Canary - novahackers
Open Canary - novahackersChris Gates
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Chris Gates
 
Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)Chris Gates
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINALChristoph Sinn
 

En vedette (20)

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Random number generators
Random number generatorsRandom number generators
Random number generators
 
Open Canary - novahackers
Open Canary - novahackersOpen Canary - novahackers
Open Canary - novahackers
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
Building a Successful Internal Adversarial Simulation Team - Chris Gates & Ch...
 
Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)Home Arcade setup (NoVA Hackers)
Home Arcade setup (NoVA Hackers)
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
 

Similaire à Overview of Hadoop and HDFS

Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013Kai Wähner
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationInside Analysis
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 

Similaire à Overview of Hadoop and HDFS (20)

Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 

Dernier

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Dernier (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Overview of Hadoop and HDFS

  • 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Introduction, Background to Hadoop and HDFS! ! ! ! ! Brendan Tierney
  • 2. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What is Big Data? O’Reilly Radar definition: •  Big data is when the size of the data itself becomes part of the problem EMC/IDC definition: •  Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis •  McKinsey definition: •  Big Data refers to datasets whose size is beyond the availability of typical database software tools to capture, store, manage and analyse http://www.oreilly.com/data/free/big-data-now-2012.csp! http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf! http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation! http://csrc.nist.gov/groups/SMA/forum/documents/june2012presentations/fcsm_june2012_cooper_mell.pdf
  • 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Some Companies continue to generate large amounts of data: •  Facebook ~ 6 billion messages per day •  EBay ~ 2 billion page views a day, ~ 9 Petabytes of storage •  Satellite Images by Skybox Imaging ~ 1 Terabyte per day •  These numbers are probably out of date before I finished writing this slide Important : This is for some companies and not all companies Part of their data management architecture. It will not replace existing DBs etc
  • 4. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Basic idea •  The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace (data) which we can use and analyse •  Big Data therefore refers to our ability to make use of ever increasing volumes of data Traditional data storage methods can be a challenge! Why ?
  • 5. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data
  • 6. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2013 2013
  • 7. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2014 Where is Predictive Analytics?
  • 8. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com 2015
  • 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop •  Existing tools were not designed to handle such large amounts of data •  "The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.” •  http://hadoop.apache.org •  – Process Big Data on clusters of commodity hardware •  – Vibrant open-source community •  – Many products and tools reside on top of Hadoop
  • 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? Big websites Big telcos Big Banks Big Financial CERN Big ….
  • 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Access Speeds? 1990: Typical drive ~1370MB Transfer speed ~ 4.4MB/s read drive in 5 mins 2010: Typical drive ~1TB Transfer speed ~ 100MB/s read drive in 2.5 hrs Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
  • 12. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue $ $ $ $ ?
  • 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Scaling issue •  It is harder and more expensive to scale-up ( “It Depends” needs to be applied) •  Add additional resources to an existing node (CPU, RAM) •  Moore’s Law can’t keep up with data growth •  New units must be purchased if required resources can not be added •  Also known as scale vertically •  Scale-Out •  Add more nodes/machines to an existing distributed application •  Software Layer is designed for node additions or removal •  Hadoop takes this approach - A set of nodes are bonded together as a single distributed system •  Very easy to scale down as well
  • 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Principles •  Scale-Out rather than Scale-Up •  Bring code to data rather than data to code •  Deal with failures – they are common •  Abstract complexity of distributed and concurrent applications •  Self managing •  Auto parallel processing
  • 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data – Example Applications Not all of these are using Hadoop or require Hadoop!
  • 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Cluster •  A set of "cheap" commodity hardware •  Networked together •  Resides in the same location •  Set of servers in a set of racks in a data center •  “Cheap” Commodity Server Hardware •  No need for super-computers, use commodity unreliable hardware •  Not desktops Yes you can build a Hadoop Cluster using Raspberry Pi’s
  • 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Abstracting Complexity •  Distributed Computing is HARD WORK •  Hadoop abstracts many complexities in distributed and concurrent applications •  Defines small number of components •  Provides simple and well defined interfaces of interactions between these components •  Frees developer from worrying about system level challenges •  race conditions, data starvation •  processing pipelines, data partitioning, code distribution, etc. •  Allows developers to focus on application development and business logic
  • 18. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop vs RDBMS •  Always keep the phrase “It Depends” in mind when discussing Big Data •  Hadoop != RDBMS •  Hadoop will not replace RDBMS •  Hadoop is part of your data management architecture •  and only if it is needed !
  • 19. RDBMS Hadoop Data size Gigabytes Petabytes Access Interactive & Batch Batch Updates Read & write many times Write once, read many times Integrity High Low Scaling Non Linear Linear Data representation Structured Unstructured, semi- structured
  • 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com
  • 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Current trends for Hadoop
  • 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Working together •  Hadoop and RDBMS frequently complement each other within an architecture •  For example, a website that •  has a small number of users •  produces a large amount of audit logs
  • 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosystem
  • 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  • 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Ecosytems
  • 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Large number of independent products (Apache projects) •  Can be challenging to get all/some of these to work together •  We will will be working with Hadoop, installing and using some products •  Hadoop Distributions aim to resolve version incompatibilities •  Distribution Vendor will •  Integration Test a set of Hadoop products •  Package Hadoop products in various installation formats •  Linux Packages, tarballs, etc. •  Distributions may provide additional scripts to execute Hadoop •  Some vendors may choose to backport features and bug fixes made by Apache •  Typically vendors will employ Hadoop committers so the bugs they find will make it into Apache’s repository
  • 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop Distributions •  Cloudera Distribution for Hadoop (CDH) •  Check out the pre-built VM with most of Cloudera products (Hadoop, etc) •  http://www.cloudera.com/downloads/quickstart_vms/5-8.html •  MapR Distribution •  Check out the MapR Sandbox VM •  https://www.mapr.com/products/mapr-sandbox-hadoop •  Hortonworks Data Platform (HDP) •  Check out the Hortonworks Sandbox VM •  http://hortonworks.com/products/sandbox/ •  Oracle Big Data Applicance •  Check out a pre-built VM with Hadoop, Oracle and lots of other tools all installed and configured for you to use •  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle- bigdatalite-2104726.html $
  • 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Hadoop - “move-code-to-data” approach •  Data is distributed among the nodes as it is initially stored in the system •  Data is replicated multiple times on the system for increased reliability & availability •  Master allocates work to nodes •  Computation happens on the nodes where the data is stored - data locality •  Nodes work in parallel each on their own part of the overall dataset •  Nodes are independent and self-sufficient - shared-nothing architecture •  If a node fails, master detects the failure and re-assigns work to other nodes •  If a failed node restarts, it is automatically added back into the system and assigned new tasks
  • 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  A distributed file system modelled on the Google File System (GFS)
 [http://research.google.com/archive/gfs.html] •  Data is split into blocks, typically 64MB or 128MB in size, spread across many nodes •  Works better on large files >= 1 HDFS block in size •  Each block is replicated to a number of nodes (typically 3) •  ensures reliability and availability •  Files in HDFS are write once - no random writes to files allowed •  HDFS is optimised for large streaming reads of files - no random access to files allowed •  see HIVE later on for more DBMS-type access to HDFS files....
  • 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is good for •  Storing large files •  Terabytes, Petabytes, etc... •  Millions rather than billions of files •  100MB or more per file •  Streaming data •  Unstructured data => really mixed structured data •  Write once and read-many times patterns •  Schema on Read (RDBMS = schema on write) •  Huge time saving at data write time •  BUT !!! •  Optimized for streaming reads rather than random reads •  “Cheap” Commodity Hardware •  No need for super-computers, use less reliable commodity hardware
  • 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS is not so good at •  Low-latency reads •  High-throughput rather than low latency for small chunks of data •  HBase and other DBs can address this issue (?) •  Large amount of small files •  Better for millions of large files instead of billions of small files •  Block size of 128M or 256M •  For example each file can be 100MB or more •  Multiple Writers •  Single writer per file •  Writes only at the end of file, no-support for arbitrary offset •  Time needed for replication
  • 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  Two types of nodes in a HDFS cluster •  NameNode - the master node •  DataNodes - slave or worker nodes •  NameNode manages the file system •  keeps track of the metadata - which blocks make up a file (using 2 files - namespace image and the edit log) •  knows on which DataNodes the blocks are stored •  DataNodes do the work •  store the blocks •  retrieve blocks when requested to (by the client or the NameNode) •  poll and report back to the NameNode periodically with the list of blocks that they are storing
  • 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS •  When a client application wants to read a file... •  it communicates with the NameNode to determine which blocks make up the file, and on which DataNodes the block reside •  it then communicates directly with the DataNodes •  NameNode is the single point of failure of a Hadoop system •  backup periodically to remote NFS (setup as part of Hadoop configuration) •  use Secondary NameNode •  not the same as the NameNode •  periodically merges namespace with edit log and maintains a copy
  • 38. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  • 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Files and Blocks •  Files are split into blocks (single unit of storage) •  Managed by Namenode, stored by Datanode •  Transparent to user •  Replicated across machines at load time •  Same block is stored on multiple machines •  Good for fault-tolerance and access •  Can lead to inconsistent reads •  Default replication is 3 Have you ever experienced inconsistent reads?
  • 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Writes
  • 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com HDFS File Reads
  • 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Who is using Hadoop in Ireland ? •  List of Cloudera customers in Ireland •  Citi •  Allianz •  Deutsche Bank •  Ulster Bank •  dun & bradstreet •  Ryanair •  BT •  Vodafone •  Novartis •  airbnb •  Dell •  Intel •  Rockwell Automation •  Revenue •  Adecco •  Experian •  M&S
  • 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Discuss Hadoop is not FREE J vs Hadoop is not FREE L
  • 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Something to think about