SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Hadoop 101: Back to School
St. Louis Hadoop Users Group
Wednesday, September 6, 2017
Photo by JJ Thompson on Unsplash
Agenda
1. The V’s of Big Data
2. Hadoop Foundation
3. Hadoop Projects
a. Flume, Hive, Sqoop, Spark, Storm, and Kafka
4. Use Cases
5. Cloud
6. Getting your own environment setup
The V’s of Big Data
Photo by Bruno Martins on Unsplash
The V’s of Big Data
The V’s of Big Data
1. Volume - quantity of data, too much for one machine
2. Variety - tweets, videos, iot, databases, logs
3. Velocity - batch, streaming from many devices
4. Variability - meaning of data changes, ex: sentiment
5. Veracity - data quality, accuracy
Hadoop Goals
● Scalability
● Reliability
● Cost
● Parallel processing
Hadoop Support among distros
● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog
● Five supporters
○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive,
Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper
● Four supporters
○ Apache Kafka, Apache Mahout, Hue
● Three supporters
○ Apache DataFu, Apache Impala, Cascading
● Be careful about versions!
○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2
38
Total number of projects on the Apache Software Foundation “big data” list
Not counting Apache Hive, Apache HBase + others!
Apache Hadoop - Hadoop Distributed File System (HDFS)
● Store data across many machines
● Designed to store large files
○ Files are split into blocks
○ Blocks are replicated across different nodes in the cluster
● Many other Hadoop projects store their data in HDFS
● Using HDFS
○ Indirectly via other services (Hive, HBase, Spark, etc)
○ Access it directly using the command line:
■ hdfs dfs -help
■ hdfs dfs -ls
■ hdfs dfs -mkdir /tmp/something
Apache MapReduce
● Framework for processing data in HDFS
● Largely being replaced by higher level frameworks like Spark, Hive, etc.
● Core concepts are still important
○ A Job is split into multiple tasks to execute in parallel
○ Map - a transformation, filter, and/or sorting
○ Reduce - summarization like count, average..
● Using MapReduce
○ Write a Java app using MapReduce API
○ Submit to run on the cluster
bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives
myarchive.zip input output
Apache Flume
● Tool for reliably ingesting data into Hadoop
● Core concepts
○ Agent - JVM processing event flow
○ Source - input - events from files, avro, thrift, twitter, kafka, etc.
○ Channel - passive store until event is consumed by the sink
○ Sink - output - to HDFS or another agent
● Using Flume
○ Create configuration file (Java properties file)
○ Start flume agent on nodes using command line
Apache Hive
● Query files in HDFS with “SQL”
● Schema on read
● Supports a variety of file formats
○ Plain text - delimited files like CSV, TSV
○ Columnar file formats - ORC, Parquet
○ Avro
○ JSON (with a serde)
● Using Hive
○ Command line with hive from the edge node
○ beeline (command line tool) - uses JDBC
○ Web UI like Hue or Ambari
○ SQuirreL or other clients
Apache Sqoop
● Move between Hadoop and structured data stores like relational databases
○ Import - From RDBMS to Hadoop
○ Export - From Hadoop to RDBMS
● Uses JDBC to connect to the database and can write files HDFS and/or Hive
● Using Sqoop
○ Use the command line tool from the edge node
$ sqoop import 
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' 
--split-by a.id --target-dir /user/foo/joinresults
Apache Spark
● Framework for batch and streaming (micro-batch) data processing
● Faster (in memory!) and easier to use than MapReduce
● Modules
○ Spark SQL for SQL and structured data processing
○ MLlib for machine learning
○ GraphX for graph processing
○ Spark Streaming.
● Using Spark
○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster
○ Use pyspark, python REPL (read-eval-print loop)
○ Use spark-shell, scala REPL
○ Notebook like Jupyter, Zeppelin
Apache Storm
● Framework for processing streaming data in real-time
● Message at a time, not micro-batch
● Concepts
○ Tuples – an ordered list of elements
○ Streams – an unbounded sequence of tuples
○ Spouts – bring data in, create tuples
○ Bolts – process streams of data
○ Topologies – network of spouts and bolts
● Using Storm
○ Write Java code to build a storm topology
○ Submit uber jar to the cluster with storm CLI
Apache Kafka
● Publish-subscribe messaging for streaming data
● Installed on a cluster, data stored locally on disk
● Core concepts
○ Topics - stream of records (key, value) stored in order split up across partitions
○ Producer - puts data on topics
○ Consumer(s) - read data off topics
● Data is retained for a limited amount of time
● Consumers can read data from a given offset
● Using Kafka
○ Client API to produce/consume data or from another service to persist data for streaming
○ Command line utilities for debugging
Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics
Photo by Igor Ovsyannykov on Unsplash
Quiz #1 Answers
Blue lines are Flume agents used to install web logs from servers into hadoop
Orange line is Sqoop used to move data from Hadoop to a relational database
Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation
Photo by Samuel Zeller on Unsplash
Quiz #2 Answers
Blue lines are Sqoop used to move data from relational database to Hadoop
Orange lines would be Hive to query the data in Hadoop with SQL
Use Case #3 - IoTUse Case #3 - IoT
Quiz #3 Answers
Blue lines are Kafka, good intermediary between IoT devices and your stream processor
Orange lines could be Spark Streaming or Storm to process the data
Cloud
● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc
● Roll your own with Infrastructure as a Service
● Pros: Quicker time to market, easier to scale, integration with other cloud services
● Separation of storage and compute
○ Sacrifice storage performance for faster/easier scalability
Getting Started
● Useful skills
○ Java - troubleshooting errors
○ Linux - command line, ssh
● Locally
○ PC with 16 GB of RAM
○ VirtualBox, Putty, Browser
○ Sandbox from Hortonworks / Cloudera
● Cloud
○ Images available on Azure/Amazon
● Learning
○ Hadoop weekly email newsletter https://hadoopweekly.com/
○ YouTube, Slideshare
Links
Hadoop Apache Project Commercial Support Tracker April 2016
http://blogs.gartner.com/merv-adrian/2016/04/27/hadoop-apache-project-commercial-support-tracker-april-2016/
HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Flume https://flume.apache.org/FlumeUserGuide.html
Kafka http://kafka.apache.org/intro
Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Sqoop http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/
Sandboxes https://hortonworks.com/products/sandbox/ https://www.cloudera.com/downloads/quickstart_vms/5-12.html
Thanks!
Contact me:
Kit Menke
@kitmenke
kmenke@1904labs.com

Contenu connexe

Tendances

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡youngick
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 

Tendances (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Hadoop
HadoopHadoop
Hadoop
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 

Similaire à 9/2017 STL HUG - Back to School

Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data accessOphir Cohen
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stackJunjun Olympia
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 

Similaire à 9/2017 STL HUG - Back to School (20)

Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stack
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

Plus de Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 

Plus de Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Dernier

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Dernier (20)

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 

9/2017 STL HUG - Back to School

  • 1. Hadoop 101: Back to School St. Louis Hadoop Users Group Wednesday, September 6, 2017 Photo by JJ Thompson on Unsplash
  • 2. Agenda 1. The V’s of Big Data 2. Hadoop Foundation 3. Hadoop Projects a. Flume, Hive, Sqoop, Spark, Storm, and Kafka 4. Use Cases 5. Cloud 6. Getting your own environment setup
  • 3. The V’s of Big Data Photo by Bruno Martins on Unsplash The V’s of Big Data
  • 4. The V’s of Big Data 1. Volume - quantity of data, too much for one machine 2. Variety - tweets, videos, iot, databases, logs 3. Velocity - batch, streaming from many devices 4. Variability - meaning of data changes, ex: sentiment 5. Veracity - data quality, accuracy
  • 5. Hadoop Goals ● Scalability ● Reliability ● Cost ● Parallel processing
  • 6. Hadoop Support among distros ● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog ● Five supporters ○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive, Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper ● Four supporters ○ Apache Kafka, Apache Mahout, Hue ● Three supporters ○ Apache DataFu, Apache Impala, Cascading ● Be careful about versions! ○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2
  • 7. 38 Total number of projects on the Apache Software Foundation “big data” list Not counting Apache Hive, Apache HBase + others!
  • 8. Apache Hadoop - Hadoop Distributed File System (HDFS) ● Store data across many machines ● Designed to store large files ○ Files are split into blocks ○ Blocks are replicated across different nodes in the cluster ● Many other Hadoop projects store their data in HDFS ● Using HDFS ○ Indirectly via other services (Hive, HBase, Spark, etc) ○ Access it directly using the command line: ■ hdfs dfs -help ■ hdfs dfs -ls ■ hdfs dfs -mkdir /tmp/something
  • 9. Apache MapReduce ● Framework for processing data in HDFS ● Largely being replaced by higher level frameworks like Spark, Hive, etc. ● Core concepts are still important ○ A Job is split into multiple tasks to execute in parallel ○ Map - a transformation, filter, and/or sorting ○ Reduce - summarization like count, average.. ● Using MapReduce ○ Write a Java app using MapReduce API ○ Submit to run on the cluster bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives myarchive.zip input output
  • 10. Apache Flume ● Tool for reliably ingesting data into Hadoop ● Core concepts ○ Agent - JVM processing event flow ○ Source - input - events from files, avro, thrift, twitter, kafka, etc. ○ Channel - passive store until event is consumed by the sink ○ Sink - output - to HDFS or another agent ● Using Flume ○ Create configuration file (Java properties file) ○ Start flume agent on nodes using command line
  • 11. Apache Hive ● Query files in HDFS with “SQL” ● Schema on read ● Supports a variety of file formats ○ Plain text - delimited files like CSV, TSV ○ Columnar file formats - ORC, Parquet ○ Avro ○ JSON (with a serde) ● Using Hive ○ Command line with hive from the edge node ○ beeline (command line tool) - uses JDBC ○ Web UI like Hue or Ambari ○ SQuirreL or other clients
  • 12. Apache Sqoop ● Move between Hadoop and structured data stores like relational databases ○ Import - From RDBMS to Hadoop ○ Export - From Hadoop to RDBMS ● Uses JDBC to connect to the database and can write files HDFS and/or Hive ● Using Sqoop ○ Use the command line tool from the edge node $ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/foo/joinresults
  • 13. Apache Spark ● Framework for batch and streaming (micro-batch) data processing ● Faster (in memory!) and easier to use than MapReduce ● Modules ○ Spark SQL for SQL and structured data processing ○ MLlib for machine learning ○ GraphX for graph processing ○ Spark Streaming. ● Using Spark ○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster ○ Use pyspark, python REPL (read-eval-print loop) ○ Use spark-shell, scala REPL ○ Notebook like Jupyter, Zeppelin
  • 14. Apache Storm ● Framework for processing streaming data in real-time ● Message at a time, not micro-batch ● Concepts ○ Tuples – an ordered list of elements ○ Streams – an unbounded sequence of tuples ○ Spouts – bring data in, create tuples ○ Bolts – process streams of data ○ Topologies – network of spouts and bolts ● Using Storm ○ Write Java code to build a storm topology ○ Submit uber jar to the cluster with storm CLI
  • 15. Apache Kafka ● Publish-subscribe messaging for streaming data ● Installed on a cluster, data stored locally on disk ● Core concepts ○ Topics - stream of records (key, value) stored in order split up across partitions ○ Producer - puts data on topics ○ Consumer(s) - read data off topics ● Data is retained for a limited amount of time ● Consumers can read data from a given offset ● Using Kafka ○ Client API to produce/consume data or from another service to persist data for streaming ○ Command line utilities for debugging
  • 16. Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics Photo by Igor Ovsyannykov on Unsplash
  • 17. Quiz #1 Answers Blue lines are Flume agents used to install web logs from servers into hadoop Orange line is Sqoop used to move data from Hadoop to a relational database
  • 18. Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation Photo by Samuel Zeller on Unsplash
  • 19. Quiz #2 Answers Blue lines are Sqoop used to move data from relational database to Hadoop Orange lines would be Hive to query the data in Hadoop with SQL
  • 20. Use Case #3 - IoTUse Case #3 - IoT
  • 21. Quiz #3 Answers Blue lines are Kafka, good intermediary between IoT devices and your stream processor Orange lines could be Spark Streaming or Storm to process the data
  • 22. Cloud ● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc ● Roll your own with Infrastructure as a Service ● Pros: Quicker time to market, easier to scale, integration with other cloud services ● Separation of storage and compute ○ Sacrifice storage performance for faster/easier scalability
  • 23. Getting Started ● Useful skills ○ Java - troubleshooting errors ○ Linux - command line, ssh ● Locally ○ PC with 16 GB of RAM ○ VirtualBox, Putty, Browser ○ Sandbox from Hortonworks / Cloudera ● Cloud ○ Images available on Azure/Amazon ● Learning ○ Hadoop weekly email newsletter https://hadoopweekly.com/ ○ YouTube, Slideshare
  • 24. Links Hadoop Apache Project Commercial Support Tracker April 2016 http://blogs.gartner.com/merv-adrian/2016/04/27/hadoop-apache-project-commercial-support-tracker-april-2016/ HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html Flume https://flume.apache.org/FlumeUserGuide.html Kafka http://kafka.apache.org/intro Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual Sqoop http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/ Sandboxes https://hortonworks.com/products/sandbox/ https://www.cloudera.com/downloads/quickstart_vms/5-12.html