This document provides an overview of big data and Hadoop. It defines big data as large volumes of structured, semi-structured and unstructured data that is growing exponentially and is too large for traditional databases to handle. It discusses the 4 V's of big data - volume, velocity, variety and veracity. The document then describes Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. It outlines the key components of Hadoop including HDFS, MapReduce, YARN and related modules. The document also discusses challenges of big data, use cases for Hadoop and provides a demo of configuring an HDInsight Hadoop cluster on Azure.
1. Big Data
Hadoop
Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492
“Torture the data, and it will confess to anything.”
-Ronald Coase, Economics, Nobel Prize Laureate
"The goal is to turn data into information, and information
into insight." – Carly Fiorina
"Data are becoming the new raw material of business." – Craig Mundie,
Senior Advisor to the CEO at Microsoft.
“In God we trust. All others must bring data.” – W. Edwards
Deming, statistician, professor, author, lecturer, and consultant.
2. Agenda
• Data
• Big Data
• Hadoop
• Microsoft Azure HDInsight
• Hadoop Use Cases
• Demo
• Configure Hadoop Cluster / Azure Storage
• C# MapReduce
• Load and Analyze with Hive
• Use Pig Script for Analyze data
• Excel Power Query
3. Huston, we have a “Data” problem.
• IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by
2020, which is 50-fold growth from the beginning of 2010.
• By 2020, emerging markets will supplant the developed world as the main
producer of the world’s data.
• This flood of data is coming from many source.
• The New York Stock Exchange generates about 1 terabytes of trade data per day
• Facebook hosts approximately one petabyte of storage
• The Hadron Collider produce about 15 petabytes of data per year
• Internet Archives stores around 2 petabytes of data and growing at a rate of 20
terabytes per month.
• Mobile devices and Social Network attribute to the exponential growth of
the data.
4. 85% Unstructured, 15% Structured
• The data as we know is structured.
• Structured data refers to information with a high degree of organization, such as inclusion in a
relational database is seamless and readily searchable.
• Not all data we collect conform to a specific, pre-defined data model.
• It tends to be the human-generated and people-oriented content that does not fit neatly into
database tables
• 85 percent of business-relevant information originates in unstructured form, primarily
text.
• Lack of structure make compilation a time and energy-consuming task.
• These data are so large and complex that it becomes difficult to process using on-hand
management tools or traditional data processing applications.
• These type of data is being generated by everything around us at all times.
• Every digital process and social media exchange produces it. Systems, sensors and mobile devices
transmit it.
5. Data Types
Relational Data – SQL Data
Un-Structured Data – Twitter Feed
Semi-Structured Data – Json Un-Structured Data – Amazon Review
6. So What is Big Data?
• Big Data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
• Capturing and managing lot of information; Working with many new types of data.
• Exploiting these masses of information and new data types of applications and
extract meaningful value from big data
• The process of applying serious computing to seriously massive and often highly
complex sets of information.
• Big data is arriving from multiple sources at an alarming velocity, volume and
variety.
• More data lead to more accurate analyses. More accurate analysis may lead to
more confident decision making.
7. VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
The 4 V’s of Big Data
Volume: We currently see the exponential growth in the data
storage as the data is now more than text data. There are videos,
music and large images on our social media channels. It is very
common to have Terabytes and Petabytes of the storage system for
enterprises.
Velocity: Velocity describes the frequency at which data is
generated, captured and shared. Recent developments mean that
not only consumers but also businesses generate more data in
much shorter cycles.
Variety: Today’s data no longer fits into neat, easy to consume
structures. New types include content, geo-spatial, hardware data
points, location based, log data, machine data, metrics, mobile,
physical data points, process, RFID etc.
Veracity: This refers to the uncertainty of the data available.
Veracity isn’t just about data quality, it’s about data
understandability. Veracity has an impact on the confidence data.
8. Big Data vs Traditional Data
Traditional Big Data
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch
Updates Read and Write many times Write once, read many times
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Nonlinear Linear
9. Data Storage
• Storage capacity of the hard drives have increased massively over the years
• On the other hand, the access speeds of the drives have not kept up.
• Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s
• can read all the data in about 5 mins.
• Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s
• Take more than two and half hours to read all the data
• Writing is even slower
• The obvious ways to reduce time is to read from multiple disks at once
• Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the
data in under 2 minutes.
• Move Computing to Data rather than bring data to computing.
10. Why big data should matter to you
• The real issue is not that you are acquiring large amounts of data. It's what you do with
the data that counts. The hopeful vision is that organizations will be able to take data
from any source, harness relevant data and analyze it to find answers that enable
• cost reductions
• time reductions
• new product development and optimized offerings
• smarter business decision making.
• By combining big data and high-powered analytics, it is possible to:
• Determine root causes of failures, issues and defects in near-real time, potentially saving billions of
dollars annually.
• Send tailored recommendations to mobile devices while customers are in the right area to take
advantage of offers.
• Quickly identify customers who matter the most.
• Generate retail coupons at the point of sale based on the customer's current and past purchases.
11. Ok I Got BigData, Now what?
• The huge influx of data raises many challenges.
• Process of inspecting, cleaning, transforming, and modeling data with the
goal of discovering useful information
• To analyze and extract meaningful value from these massive amounts of
data, we need optimal processing power.
• We need parallel processing and therefore requires many pieces of hardware
• When we use many pieces of hardware, the chances that one will fail is fairly high.
• Common way to avoiding data loss is through replication
• Redundant copies of data are kept
• Data analysis tasks need to combine data
• The Data from one disk may need to combine with data from 99 other disks
12. Challenges of Big Data
• Information Growth
• Over 80% of the data in the enterprise consists of unstructured data, growing much
faster pace than traditional data
• Processing Power
• The approach to use single, expensive, powerful computer to crunch information
doesn’t scale for Big Data
• Physical Storage
• Capturing and managing all this information can consume enormous resources
• Data Issues
• Lack of data mobility, proprietary formats and interoperability obstacle can make
working with Big Data complicated
• Costs
• Extract, transform and load (ETL) processes for Big Data can be expensive and time
consuming
13. Hadoop
• Apache™ Hadoop® is an open source software project that enables the
distributed processing of large data sets across clusters of commodity
servers.
• It is designed to scale up from a single server to thousands of machines, with
a very high degree of fault tolerance.
• All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.
• Hadoop consists of the Hadoop Common package, which provides filesystem
and OS level abstractions, a MapReduce engine and the Hadoop Distributed
File System (HDFS).
14. History of Hadoop
• Hadoop is not an acronym; it’s a made-up name.
• Named after stuffed an yellow elephant of Doug Cutting’s (Project Creator) son.
• 2002-2004 : Nutch Project - web-scale, open source, crawler-based search
engine.
• 2003-2004: Google released GFS (Google File System) & MapReduce
• 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch
• 2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and
processing parts of Nutch to form Hadoop.
• 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173
Minutes (on 3400 nodes)
15. Hadoop Modules
• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
• Other Related Modules;
• Cassandra - scalable multi-master database with no single points of failure.
• HBase - A scalable, distributed database that supports structured data storage for large tables.
• Pig - A high-level data-flow language and execution framework for parallel computation.
• Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Zookeeper - A high-performance coordination service for distributed applications.
16. HDFS – Hadoop Distributed File System
• The heart of Hadoop is the HDFS.
• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
• HDFS is designed on the following assumptions and goals:
• Hardware failure is norm rather than exception.
• HDFS is designed more for batch processing rather than interactive use by users.
• Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes
to terabytes in size.
• HDFS application uses a write-once-read-many access model. A file once created,
written and closed need not be changed.
• A computation requested by an application is much more efficient if it is executed near
the data it operated on. On other words, Moving computation is cheaper than moving
data.
• Easily portable from one platform to another.
17. Hadoop Architecture
• NameNode:
• NameNode is the node which stores the filesystem metadata
i.e. which file maps to what block locations and which blocks
are stored on which datanode.
• Secondary NameNode:
• NameNode is the single point of failure.
• DataNode:
• The data node is where the actual data resides.
• All datanodes send a heartbeat message to the namenode
every 3 seconds to say that they are alive.
• The data nodes can talk to each other to rebalance data,
move and copy data around and keep the replication high.
• Job Tracker/Task Tracker:
• The primary function of the job tracker is resource
management (managing the task trackers), tracking resource
availability and task life cycle management (tracking its
progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker with
its progress status periodically.
Name Node
Secondary Name
Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
Rack 1 Rack 2
Client
Client
Read
Metadata Ops
Replication
18. HDFS - InputSplit
• InputFormat
• Split the input blocks and files into logical
chunks of type InputSplit, each of which is
assigned to a map task for processing.
• RecordReader
• A RecordReader uses the data within the
boundaries created by the input split to
generate key/value pairs.
19. MapReduce
• Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• It is this programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster.
• The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform.
• The first is the map job, which takes a set of data and
converts it into another set of data, where individual
elements are broken down into tuples (key/value
pairs).
• The reduce job takes the output from a map as input
and combines those data tuples into a smaller set of
tuples.
BIG DATA
MAP REDUCE RESULT
(tattoo, 1)
Hadoop Map-Reduce
20. Hadoop Distributions
• Microsoft Azure HDInsight
• IBM InfoSphere BigInsights
• Hortonworks
• Amazon Elastic MapReduce
• Cloudera CDH
21. Hadoop Meets The Mainframe
• BMC
• Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was
born in 1987 as an automated mainframe job scheduler.
• Compuware
• APM is an application performance management suite that also spans the arc of data
enterprise data center computing distributed commodity servers.
• Syncsort
• Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms
including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform
data integration.
• Informatica
• HParser can run its data transformation services as a distributed application on Hadoop’s
MapReduce engine.
22. Azure HDInsight
• HDInsight makes Apache Hadoop available as a service in the cloud.
• Process, analyze, and gain new insights from big data using the power
of Apache Hadoop
• Drive decisions by analyzing unstructured data with Azure HDInsight,
a big data solution powered by Apache Hadoop.
• Build and run Hadoop clusters in minutes.
• Analyze results with Power Pivot and Power View in Excel.
• Choose your language, including Java and .NET. Query and transform
data through Hive.
23. Azure HDInsight
Scale elastically on demand
Crunch all data –
structured, semi-
structured,
unstructured
Develop in your
favorite language
No hardware to
acquire or
maintain
Connect on-
premises Hadoop
clusters with the
cloud
Use Excel to
visualize your
Hadoop data
Includes NoSQL
transactional capabilities
Azure HDInsight
25. HDInsight
• The combination of Azure Storage and HDInsight provides an
ultimate framework for running MapReduce jobs.
• Creating an HDInsight cluster is quick and easy: log in to Azure,
select the number of nodes, name the cluster, and set
permissions.
• The cluster is available on demand, and once a job is completed,
the cluster can be deleted but the data remains in Azure Storage.
• Use Powershell to submit MapReduce Jobs
• Use C# to create MapReduce Programs
• Support Pig Latin, Avro, Sqoop and more.
26. Use cases
• A 360 degree view of the customer
• Business want to know to utilize social media postings to improve
revenue.
• Utilities: Predict power consumption
• Marketing: Sentiment analysis
• Customer service: Call monitoring
• Retail and marketing: Mobile data and location-based targeting
• Internet of Things (IoT)
• Big Data Service Refinery
27. Demo
• Configure HDInsight Cluster
• Create Mapper and Reducer Program using Visual Studio C#
• Upload Data to Blob Storage using Azure Storage Explorer
• Run Hadoop Job
• Export output to Power Query for Excel
• Hive Example with HDInsight
• Pig Script with HDInsight
28. VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
29. Resources for HDInsight for Windows Azure
Microsoft: HDInsight
• Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure.
• Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation.
• Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure.
Microsoft: Windows and SQL Database
• Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications.
• MSDN SQL- MSDN documentation for SQL Database
• Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud.
• Adventure Works for SQL Database - Download page for SQL Database sample database.
Microsoft: Business Intelligence
• Microsoft BI PowerPivot- a powerful data mashup and data exploration tool.
• SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights.
• SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise.
Apache Hadoop:
• Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers.
• HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
• Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
Hortonworks:
• Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.
30. About Me
Tomy Rhymond
Sr. Consultant, HMB, Inc.
ttr@hmbnet.com
http://tomyrhymond.wordpress.com
@trhymond
614.432.9492 (m)