SlideShare une entreprise Scribd logo
1  sur  40
BIG DATA AND HADOOP 
History, Technical Deep Dive, and Industry 
Trends 
Esther Kundin 
Bloomberg LP
About Me
Big Data –What is It?
Outline 
• What Is Big Data? 
• A History Lesson 
• Hadoop – Dive in to the details 
• HDFS 
• MapReduce 
• HBase 
• Industry Trends 
• Questions
What is Big Data?
A History Lesson
Big Data Origins 
• Indexing the web requires lots of storage 
• Petabytes of data! 
• Economic problem – reliable servers expensive! 
• Solution: 
• Cram in as many cheap machines as possible 
• Replace them when they fail 
• Solve reliability via software!
Big Data Origins Cont’d 
• DBs are slow and expensive 
• Lots of unneeded features 
RDBMS NoSQL 
ACID Eventual 
consistency 
Strongly-typed No type checking 
Complex Joins Get/Put 
RAID storage Commodity 
hardware
Big Data Origins Cont’d 
• Google publishes papers about: 
• GFS (2000) 
• MapReduce (2004) 
• BigTable (2006) 
• Hadoop, originally developed at Yahoo, accepted as 
Apache top-level project in 2008
Translation 
GFS HDFS 
MapReduce Hadoop MapReduce 
BigTable HBASE
Why Hadoop? 
• Huge and growing ecosystem of services 
• Pace of development is swift 
• Tons of money and talent pouring in
Diving into the details!
Hadoop Ecosytem 
• HDFS – Hadoop Distributed File System 
• Pig: a scripting language that simplifies the creation of MapReduce 
jobs and excels at exploring and transforming data. 
• Hive: provides SQL-like access to your Big Data. 
• HBase: Hadoop database . 
• HCatalog: for defining and sharing schemas . 
• Ambari: for provisioning, managing, and monitoring Apache Hadoop 
clusters . 
• ZooKeeper: an open-source server which enables highly reliable 
distributed coordination . 
• Sqoop: for efficiently transferring bulk data between Hadoop and 
relation databases . 
• Oozie: a workflow scheduler system to manage Apache Hadoop jobs 
• Mahout : scalable machine learning library
HDFS 
• Hadoop Distributed File System 
• Basis for all other tools, built on top of it 
• Allows for distributed workloads
HDFS details
HDFS Demo
MapReduce
MapReduce demo 
• To run, can use: 
• Custom JAVA application 
• PIG – nice interface 
• Hadoop Streaming + any executable, like python 
• Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- 
program-in-python/ 
• HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
HBase 
• Database running on top of HDFS 
• NOSQL – key/value store 
• Distributed 
• Good for sparse requests, rather than scans like MapReduce 
• Sorted 
• Eventually Consistent
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS
HBase Read 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests Meta 
Region Server 
address
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client determines 
Which RegionServer 
to contact and caches 
that data
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests data 
from the Region 
Server, which gets 
data from HDFS
HBase Demo
HMaster 
• Only one main master at a time – ensured by zookeeper 
• Keeps track of all table metadata 
• Used in table creation, modification, and deletion. 
• Not used for reads
Region Server 
• This is the worker node of HBase 
• Performs Gets, Puts, and Scans for the regions it handles 
• Multiple regions are handled by each Region Server 
• On startup 
• Registers with zookeeper 
• Hmaster assigns it regions 
• Physical blocks on HDFS may or may not be on the same machine 
• Regions are split if they get too big 
• Data stored in a format called Hfile 
• Cache of data is what gives good performance. Cache 
based on blocks, not rows
HBaseWrite – step 1 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Region Server 
persists write at 
the end of the 
WAL
HBaseWrite – step 2 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Regions Server 
saves write in a 
sorted map in 
memory in the 
MemStore
HBaseWrite – offline 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
When MemStore reaches 
a configurable size, it is 
flushed to an HFile
Minor Compaction 
• When writing a MemStore to Hfile, may trigger a Minor 
Compaction 
• Combine many small Hfiles into one large one 
• Saves disk reads 
• May block further MemStore flushes, so try to keep to a 
minimum
Major Compaction 
• Happens at configurable times for the system 
• Ie. Once a week on weekends 
• Default to once every 24 hrs 
• Resource-intensive 
• Don’t set it to “never” 
• Reads in all Hfiles and makes sure there is one Hfile per 
Region per column family 
• Purges deleted records 
• Ensures that HDFS files are local
Tuning your DB - HBase Keys 
• Row Key – byte array 
• Best performance for Single Row Gets 
• Best Caching Performance 
• Key Design – 
• Distributes well – usually accomplished by hashing natural key 
• MD5 
• SHA1
Tuning your DB - BlockCache 
• Each region server has a BlockCache where it stores file 
blocks that it has already read 
• Every read that is in the block increases performance 
• Don’t want your blocks to be much bigger than your rows 
• Modes of caching: 
• 2-level LRU cache, by default 
• Other options: BucketCache – can use DirectByteBuffers to 
manage off-heap RAM – better Garbage Collection stats on the 
region server
Tuning your DB - Columns and Column 
Families 
• All columns in a column families accessed together for 
reads 
• Different column families stored in different HFiles 
• All Column Families written once when any MemStore is 
full 
• Example: 
• Storing package tracking information: 
• Need package shipping info 
• Need to store each location in the path
Tuning your DB – Bloom Filters 
• Can be set on rows or columns 
• Keep an extra index of available keys 
• Slows down reads and writes a bit 
• Increases storage 
• Saves time checking if keys exist 
• Turn on if it is likely that client will request missing data
Tuning your DB – Short-Circuit Reads 
• HDFS exposes service interface 
• If file is actually local, much faster to just read Hfile 
directly off of the disk
Current Industry Trends
Big Data in Finance – the challenges 
• Real-Time financial analysis 
• Reliability 
• “medium-data”
What Bloomberg is Working on 
• Working with Hortonworks on fixing real-time issues in 
Hadoop 
• Creating a framework for reliably serving real-time data 
• Presenting at Hadoop World and Hadoop Summit 
• Open source Chef recipes for running a hadoop cluster on 
OpenStack-managed VMs
Questions? 
• Thank you!

Contenu connexe

Tendances

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_OpportunityNojan Emad
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 

Tendances (20)

Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
10c introduction
10c introduction10c introduction
10c introduction
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Azure_Business_Opportunity
Azure_Business_OpportunityAzure_Business_Opportunity
Azure_Business_Opportunity
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 

En vedette

Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveHyderabad Scalability Meetup
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

En vedette (11)

Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
Deep dive hadoop
Deep dive hadoopDeep dive hadoop
Deep dive hadoop
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Similaire à Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBaseGokuldas Pillai
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduseScott Miao
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 

Similaire à Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends (20)

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 

Dernier

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Dernier (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

  • 1. BIG DATA AND HADOOP History, Technical Deep Dive, and Industry Trends Esther Kundin Bloomberg LP
  • 4. Outline • What Is Big Data? • A History Lesson • Hadoop – Dive in to the details • HDFS • MapReduce • HBase • Industry Trends • Questions
  • 5. What is Big Data?
  • 7. Big Data Origins • Indexing the web requires lots of storage • Petabytes of data! • Economic problem – reliable servers expensive! • Solution: • Cram in as many cheap machines as possible • Replace them when they fail • Solve reliability via software!
  • 8. Big Data Origins Cont’d • DBs are slow and expensive • Lots of unneeded features RDBMS NoSQL ACID Eventual consistency Strongly-typed No type checking Complex Joins Get/Put RAID storage Commodity hardware
  • 9. Big Data Origins Cont’d • Google publishes papers about: • GFS (2000) • MapReduce (2004) • BigTable (2006) • Hadoop, originally developed at Yahoo, accepted as Apache top-level project in 2008
  • 10. Translation GFS HDFS MapReduce Hadoop MapReduce BigTable HBASE
  • 11. Why Hadoop? • Huge and growing ecosystem of services • Pace of development is swift • Tons of money and talent pouring in
  • 12. Diving into the details!
  • 13. Hadoop Ecosytem • HDFS – Hadoop Distributed File System • Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data. • Hive: provides SQL-like access to your Big Data. • HBase: Hadoop database . • HCatalog: for defining and sharing schemas . • Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters . • ZooKeeper: an open-source server which enables highly reliable distributed coordination . • Sqoop: for efficiently transferring bulk data between Hadoop and relation databases . • Oozie: a workflow scheduler system to manage Apache Hadoop jobs • Mahout : scalable machine learning library
  • 14. HDFS • Hadoop Distributed File System • Basis for all other tools, built on top of it • Allows for distributed workloads
  • 18. MapReduce demo • To run, can use: • Custom JAVA application • PIG – nice interface • Hadoop Streaming + any executable, like python • Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- program-in-python/ • HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
  • 19. HBase • Database running on top of HDFS • NOSQL – key/value store • Distributed • Good for sparse requests, rather than scans like MapReduce • Sorted • Eventually Consistent
  • 20. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS
  • 21. HBase Read Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests Meta Region Server address
  • 22. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client determines Which RegionServer to contact and caches that data
  • 23. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests data from the Region Server, which gets data from HDFS
  • 25. HMaster • Only one main master at a time – ensured by zookeeper • Keeps track of all table metadata • Used in table creation, modification, and deletion. • Not used for reads
  • 26. Region Server • This is the worker node of HBase • Performs Gets, Puts, and Scans for the regions it handles • Multiple regions are handled by each Region Server • On startup • Registers with zookeeper • Hmaster assigns it regions • Physical blocks on HDFS may or may not be on the same machine • Regions are split if they get too big • Data stored in a format called Hfile • Cache of data is what gives good performance. Cache based on blocks, not rows
  • 27. HBaseWrite – step 1 Region Server WAL (on HDFS) MemStore HFile HFile HFile Region Server persists write at the end of the WAL
  • 28. HBaseWrite – step 2 Region Server WAL (on HDFS) MemStore HFile HFile HFile Regions Server saves write in a sorted map in memory in the MemStore
  • 29. HBaseWrite – offline Region Server WAL (on HDFS) MemStore HFile HFile HFile When MemStore reaches a configurable size, it is flushed to an HFile
  • 30. Minor Compaction • When writing a MemStore to Hfile, may trigger a Minor Compaction • Combine many small Hfiles into one large one • Saves disk reads • May block further MemStore flushes, so try to keep to a minimum
  • 31. Major Compaction • Happens at configurable times for the system • Ie. Once a week on weekends • Default to once every 24 hrs • Resource-intensive • Don’t set it to “never” • Reads in all Hfiles and makes sure there is one Hfile per Region per column family • Purges deleted records • Ensures that HDFS files are local
  • 32. Tuning your DB - HBase Keys • Row Key – byte array • Best performance for Single Row Gets • Best Caching Performance • Key Design – • Distributes well – usually accomplished by hashing natural key • MD5 • SHA1
  • 33. Tuning your DB - BlockCache • Each region server has a BlockCache where it stores file blocks that it has already read • Every read that is in the block increases performance • Don’t want your blocks to be much bigger than your rows • Modes of caching: • 2-level LRU cache, by default • Other options: BucketCache – can use DirectByteBuffers to manage off-heap RAM – better Garbage Collection stats on the region server
  • 34. Tuning your DB - Columns and Column Families • All columns in a column families accessed together for reads • Different column families stored in different HFiles • All Column Families written once when any MemStore is full • Example: • Storing package tracking information: • Need package shipping info • Need to store each location in the path
  • 35. Tuning your DB – Bloom Filters • Can be set on rows or columns • Keep an extra index of available keys • Slows down reads and writes a bit • Increases storage • Saves time checking if keys exist • Turn on if it is likely that client will request missing data
  • 36. Tuning your DB – Short-Circuit Reads • HDFS exposes service interface • If file is actually local, much faster to just read Hfile directly off of the disk
  • 38. Big Data in Finance – the challenges • Real-Time financial analysis • Reliability • “medium-data”
  • 39. What Bloomberg is Working on • Working with Hortonworks on fixing real-time issues in Hadoop • Creating a framework for reliably serving real-time data • Presenting at Hadoop World and Hadoop Summit • Open source Chef recipes for running a hadoop cluster on OpenStack-managed VMs

Notes de l'éditeur

  1. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  2. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  3. Name node is the manager, data node is the worker
  4. Job Tracker = Resource Manager Task Tracker = Node Manager Number of Jobs depends on the range of keys Number of mappers is set by the user – you’d want it to correspond to the set of possible values. So, if the values are ascii, you won’t want reducers to exceed 256. You also don’t want them to exceed the number of data nodes you have.
  5. Remember, HBase treats everything as a file system
  6. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  7. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  8. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  9. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  10. All columns in a column family are read for a get – but not all column families unless specified
  11. Although there is a separate memstore per column family – as soon as one is full, all of them written to hfiles. Note also that deletes are handled with a marker, and only really purged at a major compaction
  12. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry