Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
9/2017 STL HUG - Back to School
1. Hadoop 101: Back to School
St. Louis Hadoop Users Group
Wednesday, September 6, 2017
Photo by JJ Thompson on Unsplash
2. Agenda
1. The V’s of Big Data
2. Hadoop Foundation
3. Hadoop Projects
a. Flume, Hive, Sqoop, Spark, Storm, and Kafka
4. Use Cases
5. Cloud
6. Getting your own environment setup
3. The V’s of Big Data
Photo by Bruno Martins on Unsplash
The V’s of Big Data
4. The V’s of Big Data
1. Volume - quantity of data, too much for one machine
2. Variety - tweets, videos, iot, databases, logs
3. Velocity - batch, streaming from many devices
4. Variability - meaning of data changes, ex: sentiment
5. Veracity - data quality, accuracy
6. Hadoop Support among distros
● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog
● Five supporters
○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive,
Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper
● Four supporters
○ Apache Kafka, Apache Mahout, Hue
● Three supporters
○ Apache DataFu, Apache Impala, Cascading
● Be careful about versions!
○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2
7. 38
Total number of projects on the Apache Software Foundation “big data” list
Not counting Apache Hive, Apache HBase + others!
8. Apache Hadoop - Hadoop Distributed File System (HDFS)
● Store data across many machines
● Designed to store large files
○ Files are split into blocks
○ Blocks are replicated across different nodes in the cluster
● Many other Hadoop projects store their data in HDFS
● Using HDFS
○ Indirectly via other services (Hive, HBase, Spark, etc)
○ Access it directly using the command line:
■ hdfs dfs -help
■ hdfs dfs -ls
■ hdfs dfs -mkdir /tmp/something
9. Apache MapReduce
● Framework for processing data in HDFS
● Largely being replaced by higher level frameworks like Spark, Hive, etc.
● Core concepts are still important
○ A Job is split into multiple tasks to execute in parallel
○ Map - a transformation, filter, and/or sorting
○ Reduce - summarization like count, average..
● Using MapReduce
○ Write a Java app using MapReduce API
○ Submit to run on the cluster
bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives
myarchive.zip input output
10. Apache Flume
● Tool for reliably ingesting data into Hadoop
● Core concepts
○ Agent - JVM processing event flow
○ Source - input - events from files, avro, thrift, twitter, kafka, etc.
○ Channel - passive store until event is consumed by the sink
○ Sink - output - to HDFS or another agent
● Using Flume
○ Create configuration file (Java properties file)
○ Start flume agent on nodes using command line
11. Apache Hive
● Query files in HDFS with “SQL”
● Schema on read
● Supports a variety of file formats
○ Plain text - delimited files like CSV, TSV
○ Columnar file formats - ORC, Parquet
○ Avro
○ JSON (with a serde)
● Using Hive
○ Command line with hive from the edge node
○ beeline (command line tool) - uses JDBC
○ Web UI like Hue or Ambari
○ SQuirreL or other clients
12. Apache Sqoop
● Move between Hadoop and structured data stores like relational databases
○ Import - From RDBMS to Hadoop
○ Export - From Hadoop to RDBMS
● Uses JDBC to connect to the database and can write files HDFS and/or Hive
● Using Sqoop
○ Use the command line tool from the edge node
$ sqoop import
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS'
--split-by a.id --target-dir /user/foo/joinresults
13. Apache Spark
● Framework for batch and streaming (micro-batch) data processing
● Faster (in memory!) and easier to use than MapReduce
● Modules
○ Spark SQL for SQL and structured data processing
○ MLlib for machine learning
○ GraphX for graph processing
○ Spark Streaming.
● Using Spark
○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster
○ Use pyspark, python REPL (read-eval-print loop)
○ Use spark-shell, scala REPL
○ Notebook like Jupyter, Zeppelin
14. Apache Storm
● Framework for processing streaming data in real-time
● Message at a time, not micro-batch
● Concepts
○ Tuples – an ordered list of elements
○ Streams – an unbounded sequence of tuples
○ Spouts – bring data in, create tuples
○ Bolts – process streams of data
○ Topologies – network of spouts and bolts
● Using Storm
○ Write Java code to build a storm topology
○ Submit uber jar to the cluster with storm CLI
15. Apache Kafka
● Publish-subscribe messaging for streaming data
● Installed on a cluster, data stored locally on disk
● Core concepts
○ Topics - stream of records (key, value) stored in order split up across partitions
○ Producer - puts data on topics
○ Consumer(s) - read data off topics
● Data is retained for a limited amount of time
● Consumers can read data from a given offset
● Using Kafka
○ Client API to produce/consume data or from another service to persist data for streaming
○ Command line utilities for debugging
16. Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics
Photo by Igor Ovsyannykov on Unsplash
17. Quiz #1 Answers
Blue lines are Flume agents used to install web logs from servers into hadoop
Orange line is Sqoop used to move data from Hadoop to a relational database
18. Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation
Photo by Samuel Zeller on Unsplash
19. Quiz #2 Answers
Blue lines are Sqoop used to move data from relational database to Hadoop
Orange lines would be Hive to query the data in Hadoop with SQL
21. Quiz #3 Answers
Blue lines are Kafka, good intermediary between IoT devices and your stream processor
Orange lines could be Spark Streaming or Storm to process the data
22. Cloud
● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc
● Roll your own with Infrastructure as a Service
● Pros: Quicker time to market, easier to scale, integration with other cloud services
● Separation of storage and compute
○ Sacrifice storage performance for faster/easier scalability
23. Getting Started
● Useful skills
○ Java - troubleshooting errors
○ Linux - command line, ssh
● Locally
○ PC with 16 GB of RAM
○ VirtualBox, Putty, Browser
○ Sandbox from Hortonworks / Cloudera
● Cloud
○ Images available on Azure/Amazon
● Learning
○ Hadoop weekly email newsletter https://hadoopweekly.com/
○ YouTube, Slideshare