3. Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Processing
–Explain the building blocks of Hadoop MapReduce
–Understand the concepts of YARN
–Explain Big Data with traditional data
4. Introduction
• Data processing, manipulation of data by a computer.
– It includes the conversion of raw data to machine-readable form, flow of
data through the CPU and memory to output devices, and formatting or
transformation of output.
– Any use of computers to perform defined operations on data can be
included under data processing
• Big data processing is a set of techniques or programming models
to access large-scale data to extract useful information for
supporting and providing decisions.
• This is because, Big Data helps companies to generate valuable insights.
– Companies use Big Data to refine their marketing campaigns and techniques.
– Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications
13. Hadoop MapReduce
• Big data processing is not handled by a single machine. Due to this
Mapreduce is used for handling processing of big data in a parallel and
distributed manner.
• A framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
• A processing technique and a program model for distributed computing
based on java.
16. Hadoop MapReduce….
• The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
– Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
– Reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
• Easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called
mappers and reducers.
23. • MapReduce program executes in three stages, namely map
stage, shuffle stage, and reduce stage.
–Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
–The input file is passed to the mapper function line by line.The
mapper processes the data and creates several small chunks of
data.
–Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage.The Reducer’s job is to
process the data that comes from the mapper.After processing, it
produces a new set of output, which will be stored in the HDFS.
MapReduce-Execution Stages
33. • Input reader
– The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
– Once input reads the data, it generates the corresponding key-value pairs.
– The input files reside in HDFS.The input data can be in any form.
• Map function
– The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.The map input and output type may
be different from each other.
Data Flow In Mapreduce (Phases)
34. • Partition function
– The partition function assigns the output of each Map function to the
appropriate reducer.The available key and value provide this function. It
returns the index of reducers.
• Shuffling and Sorting
– The data are shuffled between/within nodes so that it moves out from
the map and get ready to process for reduce function. Sometimes, the
shuffling of data can take much computation time.
– The sorting operation is performed on input data for Reduce function.
Here, the data is compared using comparison function and arranged in
a sorted form.
Data Flow In Mapreduce..
35. • Reduce function
– The Reduce function is assigned to each unique key.These keys are
already arranged in sorted order.The values associated with the keys
can iterate the Reduce and generates the corresponding output.
• Output writer
– Once the data flow from all the above phases, Output writer executes.
The role of Output writer is to write the Reduce output to the stable
storage.
Data Flow in Mapreduce…
37. • MapReduce Mapper Class
– In MapReduce, the role of the Mapper class is to map the input key-value
pairs to a set of intermediate key-value pairs.
– It transforms the input records into intermediate records.
– These intermediate records associated with a given output key and passed to
Reducer for the final output.
• MapReduce Reducer Class
– to reduce the set of intermediate values.
– Its implementations can access the Configuration for the job via the
JobContext.getConfiguration() method.
• MapReduce Job Class
– The Job class is used to configure the job and submits it. It also controls the
execution and query the state. Once the job is submitted, the set method
throws IllegalStateException.
MapReduce API
38. • Yet Another Resource Manager takes programming to the next level beyond Java,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.
• Different Yarn applications can co-exist on the same cluster so MapReduce,
Hbase, Spark all can run at the same time bringing great benefits for
manageability and cluster utilization.
• Jobtracker & Tasktracker were used in previous version of Hadoop,
which were responsible for handling resources and checking progress
management.
• However, Hadoop 2.0 has Resource manager and NodeManager to
overcome the shortfall of Jobtracker & Tasktracker.
Overview of YARN-Component of Hadoop 2.0
47. • Client: For submitting MapReduce jobs.
• Resource Manager: To manage the use of resources across the
cluster
• Node Manager: For launching and monitoring the computer
containers on machines in the cluster.
• Map Reduce Application Master: Checks tasks running the
MapReduce job.
– The application master and the MapReduce tasks run in containers that
are scheduled by the resource manager, and managed by the node
managers.
Components of YARN
52. • There are mainly 3 types of Schedulers in Hadoop:
–FIFO (First In First Out) Scheduler
–Capacity Scheduler
–Fair Scheduler
Types of Scheduling
53. • Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh
tasks.
• Utilization: Node Manager manages a pool of resources, rather
than a fixed number of the designated slots thus increasing the
utilization.
• Multitenancy: Different version of MapReduce can run onYARN,
which makes the process of upgrading MapReduce more manageable.
Benefits of YARN
55. Contents
• Introduction
• Abstracting Hadoop MapReduce jobs with Pig
• Performing ad hoc Big Data querying with Hive
• Creating business value from extracted data
56. Objectives
• At the end of this chapter, you are able to:
–Identify different tools and techniques for big data
–Understand the concepts of pig
–Understand the concepts of Hive
57. Introduction
• Abstracting Hadoop MapReduce jobs with Pig
–Communicating with Hadoop in Pig Latin
–Executing commands using the Grunt Shell
–Streamlining high-level processing
• Performing ad hoc Big Data querying with Hive
–Persisting data in the Hive MegaStore
–Performing queries with HiveQL
–Investigating Hive file formats
• Creating business value from extracted data
–Mining data with Mahout
–Visualizing processed results with reporting tools, BI
–Querying in real time with Impala
59. Abstracting Hadoop MapReduce jobs with Pig
• Pig was Initially developed byYahoo to get ease in programming.
– Apache Pig has the capability to process an extensive dataset as it works on
top of the Hadoop. It is used for analyzing more massive datasets by
representing them as dataflow.
– Apache Pig also raises the level of abstraction for processing enormous
datasets.
– Pig Latin is the scripting language that the developer uses for working on
the Pig framework that runs on Pig runtime.
• Features of Pig:
– EasyTo Programme
– Rich set of operators
– Ability to handle various kind of data
– Extensibility
69. Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Strategy
–Explain a Big Data strategy and considerations
70. Introduction
• Strategy is a plan of action or policy designed to achieve an overall aim.
• Big Data is Worthless Without a Big Data Strategy.
• However, it cannot be seen as something separate from the
organizational strategy, and should be firmly embedded.
• When we say a Big Data strategy, this effectively means a business
strategy that includes Big Data.
• It defines and lays out a comprehensive vision across the enterprise and
sets a foundation for the organization to employ data-related or data-
dependent capabilities.
• A well-defined and comprehensive Big Data strategy makes the benefits
or Big Data actionable for the organization.
• It sets out the steps that an organization should execute in order to
become a “Data Driven Enterprise”.
71. Introduction..
• The Big Data strategy incorporates some guiding principles to
accomplish
– the data-driven vision,
– directs the organization to select specific business goals and is the starting
point for data driven planning across the enterprise.
• Big data holds many promises, such as
– gaining valuable customer insights
– predict future
– generate new revenue streams etc
• So effective big data strategy is very necessary.
72. Big Data Considerations
• You can’t process the amount of data that you want to because of the
limitations of your current platform.
• You can’t include new/contemporary data sources (example, social media,
RFID, Sensory, Web, GPS, textual data) because it does not comply with
the data storage schema.
• You need to (or want to) integrate data as quickly as possible to be
current on your analysis.
• You want to work with a schema-on-demand data storage paradigm
because the variety of data types involved.
• The data is arriving so fast at your organization’s doorstep that your
traditional analytics platform cannot handle it
73. Critical Success Factors for Big Data Analytics
• A clear business need (alignment with the vision and the strategy).
• Strong, committed sponsorship (executive champion)
• Alignment between the business and I T strategy.
• A fact-based decision-making culture.
• A strong data infrastructure.
• The right analytics tools.
• Right people with right skills
74.
75. Business Problems Addressed by Big Data
Analytics
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities
76. What is Big Data Objectives?
• The technologies and concepts behind big data allow
organizations to achieve a variety of objectives.
• Like many new information technologies,
–big data can bring about
• dramatic cost reductions,
• substantial improvements in the time required to perform a
computing task,
• or new product and service offerings
77. Defining a Big Data Strategy
• A good Big Data Strategy will explore following subject
domain, and align it to their organizational objectives:
1. Identify an Opportunity & EconomicValue of Data
2. Defining Big Data Architecture
3. Selecting Big DataTechnologies
4. Understanding Big Data Science
5. Developing Big Data Analytics
6. Institutionalize Big Data
78. Defining a Big Data Strategy
1. Identify an Opportunity & EconomicValue of Data
–Catalog existing data sources available inside the organization, tapped
or untapped.
–Invent new ways of capturing data, integrate your data sources with
external communities. Develop semantics and metadata for
association, clustering, classification and trending.
–Identify and create opportunities to integrate and fuse data with
partner's dataset in industry likeTelecom,Travel, Financial, Healthcare,
and Entertainment Industries etc.
–Conceptualize the data insights and possible data sciences to extract
valuable data, e.g. associations, simulation, regression, correlation,
segmentation, trending, and predictive etc.
79. Defining a Big Data Strategy
Identify an Opportunity & EconomicValue of Data…
–Identify the scope of data access. i.e.Who can explore data? Who gets
access to Data Insights.
–Identify possibility of monetizing data to generate revenue from Data
Insights gained, like generating leads, campaigns, upsell/cross sell
opportunity, data streaming, data API, improving staff productivity &
customer service etc.
–Identify ethical and legal code associated with data under exploration
with respect to industry standards, organizational culture, data
policies, data privacy, regulatory and legal requirements
–Data Requirement likeWhat type of data do you need? Is it diverse
enough? How will you source it and store it?
80. Defining a Big Data Strategy
2. Defining Big Data Architecture
– Defining Business Problems & Classification of associated data, such as,
Market Sentiment Analysis, Churn Predictions, or Fraud Detection.
– Defining Data Acquisition Strategy.
– Selecting a Hadoop Framework
– Big Data Life Cycle Management Framework.
– Choosing Big Data stores, traditional or noSQL and Polyglot
persistence.
– Defining Big Data Infrastructure & PlatformTaxonomy
– Identifying Big Data Analytics Frameworks, and associated Machine
Learning Sciences.
– develop Data Monetization Strategy to exploit its value internally
within the enterprise, or externally
81. Defining a Big Data Strategy
3. Selecting Big DataTechnologies
• Having the appropriate infrastructure in place to support the data you need is
essential.
• Be sure to consider the four layers of data including collecting, storing,
processing/analysing and communicating insights from the data
– InternetTechnologies
– Machine learning
– Commodity Hardware
– Distributed processing
– Leverage Cloud based approach to reduce time to market, reduce risk
and gain better SLA out of the box
82. Defining a Big Data Strategy
4. Understanding Big Data Science
– Data Science is the ongoing process of discovering information from data. It is a
process that never stops, and often one question leads to another new question.
It focuses on real-world problems, and tries to explain it
– Machine Learning
• Supervised
• Unsupervised, hybride
– Common Algorithms
• Classification
• Clustering
• Associations & Correlations
• Text Mining
• Linear Regression
83. Defining a Big Data Strategy
5. Developing Big Data Analytics
• Big Data applications varies based on industry. Businesses are trying to find
value in monetizing data, or use it to improve efficiencies and customer
experience.
• Considering the different types of big data analytics is important
– Descriptive
– Diagnostic
– Predictive
– Prescriptive
84. Defining a Big Data Strategy
6. Institutionalize Big Data
• Each enterprise will tailor the Big Data to meet the objectives of their
particular vision.
– Discovery (Opportunity, Requirements, Best Fit)
– Proof of Concept (to Evaluate BusinessValue)
– Provision Infra (here Big Data Elasticity comes to play)
– Ingest (Source the data)
– Process (Transform,Analyze, Data Science)
– Publish (Share the learnings)
• Governance: the current state of data quality, security, access,
ownership, ethics and data privacy within the organization.
• Considering Skill and capacity is also required
85. Defining a Big Data Strategy
Generally:
–Establishing your Big Data needs
–Meeting business goals with timely data
–Evaluating commercial Big Data tools
–Managing organizational expectations is very important when we
define a big data strategy
86. Enabling analytical innovation in a Big Data
• Data can drive innovation in two ways.
– Data can motivate ideation, development, execution and
evaluation of new innovations.
– And it can underpin, or be a central component of new products,
services, operations or business models.
• Recent advances in machine learning and the vast amount of digitalized
data.
– For the first time, a machine powered by analytics was able to win against
the best human player in the world in the game “Go.”
• Self-driving cars that rely on the large number of digitized images that
improve vision recognition systems dramatically.
87. Enabling analytical innovation in a Big Data..
How big data fuel innovation?
• “Analytics is really great at finding linkages or hidden patterns we
may not easily observe by mining through a ton of data.”
• “Analytics can really drive the creation of recombination’, or
combining a diverse set of existing technologies in a new way.”
• “We can use lessons learned from past generations of IT and
analytics technologies to inform us about what the future could look
like.”
– Focusing on business importance
– Framing the problem
– Selecting the correct tools
– Achieving timely results
89. Contents
• Introduction
• Selecting suitable vendors and hosting options
• Balancing costs against business value
• Keeping ahead of the curve
90. Objectives
• At the end of this chapter, you are able to:
–Understand the concepts and criterion of Selecting suitable
vendors and hosting options
– Explain Balancing costs against business value
91. Introduction
• To be sure, big data solutions are in great demand.
• Today, enterprise leaders know that their big data is one of
their most valuable resources and one they can’t afford to
ignore.
• As a result, they are looking for hardware and software that
can
–help them store, manage and analyze their big data.
• Experts suggest that a good way to start the process of
selecting a big data solution is
–to determine exactly what kind of solution you need.
99. Reading
• What are the criterion to select suitable vendors
and hosting options?
• Hardware
• Software
• Professional service
• How to consider balancing of costs against business
value generated from big data
• The knowledge and skill required to Keep ahead of
the curve
100. Common Types of Big Data Solution
• Enterprise vendors offer a wide array of different types of big
data solutions.
• The kind of big data application that is right for organizations
will depend on their goals
• The best approach is to define the goals clearly at the outset
and then go looking for products that will help to reach those
goals.
101.
102. A Big Data Solution Hosting Options
• On-PremiseVs Cloud-Based Big Data Applications
–Want to host a big data software in organization data center or use a
cloud-based solution.
• ProprietaryVs Open Source Big Data Applications
–Is the organization have skill professional, open source solutions up
and running and configured for their needs
–need to purchase support or consulting services (consider those
expenses when figuring out total cost of ownership)
• BatchVs Streaming Big Data Applications
– Does organizations want to analyze data in real-time or batch?
–Both real-time and batch data processing Lambda architecture
103. Selection Criterion Or Success Factors
• Integration with Legacy Technology
• Performance
• Scalability
• Usability
• Visualization
• Flexibility
• Security
• Support
–Even experienced IT professionals sometimes find it
difficult to deploy, maintain and use complex big data
applications.
104. • Ecosystem: a big data platform that integrates with a lot of
other popular tools and a vendor with strong partnerships with
other providers
• Self-Service Capabilities
• Total Cost of Ownership
• Estimated time to value
• Artificial Intelligence and Machine Learning
– how innovative the various big data solution vendors are?
–AI and machine learning research are advancing at an incredible rate
and becoming a mainstream part of big data analytics solution
Selection Criterion Or Success Factors..