SlideShare une entreprise Scribd logo
1  sur  17
INTRODUCTION TO
HADOOP
Explaining a complex product in 20 minutes or less…
INTRODUCTION
Keith R. Davis
Data Architect – NEMSIS Project
University of Utah, School of Medicine
keith.davis@hsc.utah.edu
WHAT IS HADOOP?
Hadoop is an open source Apache software project that enables
the distributed processing of large data sets across clusters of
commodity servers.
A QUICK BIT OF HISTORY…
• (2004) Google publishes the GFS and MapReduce papers
• (2005) Apache Nutch search project rewritten to use MapReduce
• (2006) Hadoop was factored out of the Apache Nutch project
• (2006) Development was sponsored by Yahoo
• (2008) Becomes a top-level Apache project
• (Trivia) Why is it called Hadoop?
• It was named after the principal architect’s son's toy elephant!
WHO IS USING HADOOP?
And more…
HOW IS HADOOP DIFFERENT FROM A
TRADITIONAL RDBMS?
• Data is not stored in tables
• Haoop supports only forward parsing
• Hadoop doesn’t guarantee ACID properties
• Hadoop takes code to the data
• Scales horizontally vs. vertically
WHAT’S THE BIG DEAL?
Hadoop is:
• Easily Scalable– New cluster nodes can be added as needed
• Cost effective– Hadoop brings massively parallel computing to commodity servers
• Flexible– Hadoop is schema-less, and can absorb any type of data
• Fault tolerant– Share nothing architecture prevents data loss and process failure
WHEN SHOULD I USE HADOOP?
Use Hadoop when you need to:
• Process a terabytes of unstructured data
• Running batch jobs is acceptable
• You have access to a lot of cheap hardware
DO NOT use Hadoop when you need to:
• Perform calculations with little or no data (Pi to one million places)
• Process data in a transactional manner
• Have interactive ad-hoc results (this is changing)
BASIC ARCHITECTURE
Hadoop consists of two primary services:
1. Reliable storage though HDFS (Hadoop Distributed File System)
2. Parallel data processing using a technique known as MapReduce
HOW IT WORKS: HDFS WRITE STEP #1
(FILE SPLITS)
Input Data
(CSV)
Block #2
Block #1
Block #3
HOW IT WORKS: HDFS WRITE STEP #2
(REPLICATION)
Block
#1
Block
#2
Block
#1
Block
#3
Block
#3
Block
#2
Node #1 Node #2
Node #3
HOW IT WORKS: MAP/REDUCE
Client
Job
Scheduler
Data
Node
Data
Node
Data
Node
Data
Node
...
...
HDFSFileSystem(input)
HDFSFileSystem(output)
Mapper
Mapper
Mapper
Reducer
Reducer
Mapper
Mapper
Mapper
Reducer
Reducer
LOOKS COMPLICATED!
Not to worry, there are many ways to access the power of MapReduce:
• Hadoop Java API (If you like Java and low level stuff)
• Pig (If you are a script wiz and LINQ doesn’t scare you)
• Hive (You know some SQL and coding isn’t your thing)
• RHadoop (If R is your thing)
• SAS/ACCESS (If SAS is your thing)
HIVE: THE EASY WAY TO GET DATA OUT
• Supports the concepts of databases, tables, and partitions through the use of
metadata (think of views over delimited text files)
• Supports a restricted version of SQL (no updates or deletes)
• Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT)
• Supports UNION to combine multiple SELECT STATEMENTS
• Provides a rich set of data types and predefined functions
• Allows the user to create custom scalar and aggregate functions
• Executes queries via MapReduce
• Provides JDBC and ODBC drivers for integration with other applications
• Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant
HIVE: MATH AND STATS FUNCTIONS
If you use HIVE to create sample sets for your analysis, here are a few standard
functions you may find useful:
round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin()
, hex(), unhex(), conv(), abs(), pmod(), sin(), asin(), cos(), acos(), tan(), atan(),
degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(),
min(), max(), variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop
(), covar_samp(), corr(), percentile(), percentile_approx(), histogram_numeric(),
collect_set()
RESOURCES
• Cloudera (Easy Setup) - http://www.cloudera.com/content/cloudera/en/home.html
• NoSQL - http://nosql-database.org/
• Emulab - http://www.emulab.net/
• Apache Hadoop - http://hadoop.apache.org/#Getting+Started
• RHadoop - https://github.com/RevolutionAnalytics/RHadoop/wiki
• SAS/ACCESS - http://www.sas.com/software/data-management/access/index.html
THANK YOU!

Contenu connexe

Tendances

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSri Ambati
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 

Tendances (19)

Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Indexing HDF5: A Survey
Indexing HDF5: A SurveyIndexing HDF5: A Survey
Indexing HDF5: A Survey
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
HDF Server
HDF ServerHDF Server
HDF Server
 
America Runs on Excel and HDF5 - Glued together by Python
America Runs on Excel and HDF5 - Glued together by PythonAmerica Runs on Excel and HDF5 - Glued together by Python
America Runs on Excel and HDF5 - Glued together by Python
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 

En vedette

Organizationalbehaviour 120626122600-phpapp02
Organizationalbehaviour 120626122600-phpapp02Organizationalbehaviour 120626122600-phpapp02
Organizationalbehaviour 120626122600-phpapp02Mehul Rasadiya
 
Social system and organizational culture
Social system and organizational cultureSocial system and organizational culture
Social system and organizational cultureUniversity of Cebu
 
U 1.3 ob bba-ii ob models
U 1.3 ob bba-ii ob modelsU 1.3 ob bba-ii ob models
U 1.3 ob bba-ii ob modelsRai University
 
Social systems and organizational culture
Social systems and organizational cultureSocial systems and organizational culture
Social systems and organizational cultureDevons Somera
 
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTURE
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTURESOCIAL SYSTEMS AND ORGANIZATIONAL CULTURE
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTUREace boado
 
Models Of Organizational Behavior
Models Of Organizational BehaviorModels Of Organizational Behavior
Models Of Organizational BehaviorJOHNY NATAD
 
Fundamentals of organizational behavior ppt
Fundamentals of organizational behavior pptFundamentals of organizational behavior ppt
Fundamentals of organizational behavior pptGiovanni Macahig
 
Organizational behavior
Organizational behaviorOrganizational behavior
Organizational behaviorpriyasharmma
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedSlideShare
 

En vedette (13)

Intox 1
Intox 1Intox 1
Intox 1
 
Organizationalbehaviour 120626122600-phpapp02
Organizationalbehaviour 120626122600-phpapp02Organizationalbehaviour 120626122600-phpapp02
Organizationalbehaviour 120626122600-phpapp02
 
Organisation culture
Organisation cultureOrganisation culture
Organisation culture
 
Chapter 3 social system and organizational culture
Chapter 3 social system and organizational cultureChapter 3 social system and organizational culture
Chapter 3 social system and organizational culture
 
Social system and organizational culture
Social system and organizational cultureSocial system and organizational culture
Social system and organizational culture
 
U 1.3 ob bba-ii ob models
U 1.3 ob bba-ii ob modelsU 1.3 ob bba-ii ob models
U 1.3 ob bba-ii ob models
 
Social systems and organizational culture
Social systems and organizational cultureSocial systems and organizational culture
Social systems and organizational culture
 
Models of OB
Models of OB Models of OB
Models of OB
 
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTURE
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTURESOCIAL SYSTEMS AND ORGANIZATIONAL CULTURE
SOCIAL SYSTEMS AND ORGANIZATIONAL CULTURE
 
Models Of Organizational Behavior
Models Of Organizational BehaviorModels Of Organizational Behavior
Models Of Organizational Behavior
 
Fundamentals of organizational behavior ppt
Fundamentals of organizational behavior pptFundamentals of organizational behavior ppt
Fundamentals of organizational behavior ppt
 
Organizational behavior
Organizational behaviorOrganizational behavior
Organizational behavior
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
 

Similaire à Hadoop intro

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
 

Similaire à Hadoop intro (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 

Dernier

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Dernier (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Hadoop intro

  • 1. INTRODUCTION TO HADOOP Explaining a complex product in 20 minutes or less…
  • 2. INTRODUCTION Keith R. Davis Data Architect – NEMSIS Project University of Utah, School of Medicine keith.davis@hsc.utah.edu
  • 3. WHAT IS HADOOP? Hadoop is an open source Apache software project that enables the distributed processing of large data sets across clusters of commodity servers.
  • 4. A QUICK BIT OF HISTORY… • (2004) Google publishes the GFS and MapReduce papers • (2005) Apache Nutch search project rewritten to use MapReduce • (2006) Hadoop was factored out of the Apache Nutch project • (2006) Development was sponsored by Yahoo • (2008) Becomes a top-level Apache project • (Trivia) Why is it called Hadoop? • It was named after the principal architect’s son's toy elephant!
  • 5. WHO IS USING HADOOP? And more…
  • 6. HOW IS HADOOP DIFFERENT FROM A TRADITIONAL RDBMS? • Data is not stored in tables • Haoop supports only forward parsing • Hadoop doesn’t guarantee ACID properties • Hadoop takes code to the data • Scales horizontally vs. vertically
  • 7. WHAT’S THE BIG DEAL? Hadoop is: • Easily Scalable– New cluster nodes can be added as needed • Cost effective– Hadoop brings massively parallel computing to commodity servers • Flexible– Hadoop is schema-less, and can absorb any type of data • Fault tolerant– Share nothing architecture prevents data loss and process failure
  • 8. WHEN SHOULD I USE HADOOP? Use Hadoop when you need to: • Process a terabytes of unstructured data • Running batch jobs is acceptable • You have access to a lot of cheap hardware DO NOT use Hadoop when you need to: • Perform calculations with little or no data (Pi to one million places) • Process data in a transactional manner • Have interactive ad-hoc results (this is changing)
  • 9. BASIC ARCHITECTURE Hadoop consists of two primary services: 1. Reliable storage though HDFS (Hadoop Distributed File System) 2. Parallel data processing using a technique known as MapReduce
  • 10. HOW IT WORKS: HDFS WRITE STEP #1 (FILE SPLITS) Input Data (CSV) Block #2 Block #1 Block #3
  • 11. HOW IT WORKS: HDFS WRITE STEP #2 (REPLICATION) Block #1 Block #2 Block #1 Block #3 Block #3 Block #2 Node #1 Node #2 Node #3
  • 12. HOW IT WORKS: MAP/REDUCE Client Job Scheduler Data Node Data Node Data Node Data Node ... ... HDFSFileSystem(input) HDFSFileSystem(output) Mapper Mapper Mapper Reducer Reducer Mapper Mapper Mapper Reducer Reducer
  • 13. LOOKS COMPLICATED! Not to worry, there are many ways to access the power of MapReduce: • Hadoop Java API (If you like Java and low level stuff) • Pig (If you are a script wiz and LINQ doesn’t scare you) • Hive (You know some SQL and coding isn’t your thing) • RHadoop (If R is your thing) • SAS/ACCESS (If SAS is your thing)
  • 14. HIVE: THE EASY WAY TO GET DATA OUT • Supports the concepts of databases, tables, and partitions through the use of metadata (think of views over delimited text files) • Supports a restricted version of SQL (no updates or deletes) • Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT) • Supports UNION to combine multiple SELECT STATEMENTS • Provides a rich set of data types and predefined functions • Allows the user to create custom scalar and aggregate functions • Executes queries via MapReduce • Provides JDBC and ODBC drivers for integration with other applications • Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant
  • 15. HIVE: MATH AND STATS FUNCTIONS If you use HIVE to create sample sets for your analysis, here are a few standard functions you may find useful: round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin() , hex(), unhex(), conv(), abs(), pmod(), sin(), asin(), cos(), acos(), tan(), atan(), degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(), min(), max(), variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop (), covar_samp(), corr(), percentile(), percentile_approx(), histogram_numeric(), collect_set()
  • 16. RESOURCES • Cloudera (Easy Setup) - http://www.cloudera.com/content/cloudera/en/home.html • NoSQL - http://nosql-database.org/ • Emulab - http://www.emulab.net/ • Apache Hadoop - http://hadoop.apache.org/#Getting+Started • RHadoop - https://github.com/RevolutionAnalytics/RHadoop/wiki • SAS/ACCESS - http://www.sas.com/software/data-management/access/index.html

Notes de l'éditeur

  1. I have over 18 years of data architecture experience within the health care domain. I have worked in private, government and education sectors.
  2. Apache Nutch is an open source web-search software project. Stemming from the Apache Lucene project.Apache Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.
  3. Ebay - 532 nodes cluster with4256 cores and about 5.3PB of raw storage Used to search optimization and research.Facebook - A 1100-machine cluster with 8800 cores and about 12 PB raw storage. Used for reporting and machine learning.
  4. Data Not in Tables - At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.Forward Parsing - So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Delete’. Partitioning of data using multiple files can allow you to reprocess files to simulate updates and deletions.ACID Properties – Atomicity (transactions), Consistency (database conforms to all rules), Isolation (each transaction does not effect another), and Durability (Permanent Storage for committed tranactions). Especially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP architectures.Code to the Data - In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.Horizontal Scaling - Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
  5. Scalable – They can be added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost Effective - The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible – The data can be structured or unstructured and from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant - When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
  6. Data is copied into HDFS (just like any file system operation) and is split into blocks.Typical block size: UNIX = 4KB vs. HDFS = 128MB
  7. Each data blocks is replicated to multiple machines and allows for node failure without data loss. (Point to how this would work)
  8. The client application submits a job to be executed.The job scheduler allocates mappers and reducers to process the input data.The data is then filtered, hashed by the mappers. This basically produces a giant hash table of key-value pairs.Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.The reducers perform any aggregations on the mapped data and return a single reduced result set to the client.Notes: Map functionThe Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.Partition functionEach Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce.A typical default is to hash the key and use the hash value modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish.Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.Reduce functionThe framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
  9. PIG is syntactically similar to LINQ.Most of you will probably will want to use Hive to create sample sets or the RHadoop R packages.
  10. Emulab at the U of U is a great way to setup and play with an Hadoop cluster. Your instructor will need to create a project and grant you access to it.