SlideShare une entreprise Scribd logo
1  sur  105
Télécharger pour lire hors ligne
COLLEGE OF COMPUTING AND INFORMATICS
MSc. in Information Technology
Processing Big Data
Nov, 2021
Assosa, Ethiopia
Contents
• Introduction
• Integrating disparate data stores
• Employing Hadoop MapReduce
• Building blocks of Hadoop MapReduce
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Processing
–Explain the building blocks of Hadoop MapReduce
–Understand the concepts of YARN
–Explain Big Data with traditional data
Introduction
• Data processing, manipulation of data by a computer.
– It includes the conversion of raw data to machine-readable form, flow of
data through the CPU and memory to output devices, and formatting or
transformation of output.
– Any use of computers to perform defined operations on data can be
included under data processing
• Big data processing is a set of techniques or programming models
to access large-scale data to extract useful information for
supporting and providing decisions.
• This is because, Big Data helps companies to generate valuable insights.
– Companies use Big Data to refine their marketing campaigns and techniques.
– Companies use it in machine learning projects to train machines, predictive modeling, and
other advanced analytics applications
Algorithm to analyse:
• Association
• Classification
• Integration
Integrating disparate data stores
Types of Data Integration Tools includes
Types of Data Integration Tools includes..
Types of Data Integration Tools includes…
Examples of Data Integration
Common Data Integration Approaches
Common Data Integration Approaches..
Hadoop MapReduce
• Big data processing is not handled by a single machine. Due to this
Mapreduce is used for handling processing of big data in a parallel and
distributed manner.
• A framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
• A processing technique and a program model for distributed computing
based on java.
Hadoop MapReduce..
Hadoop MapReduce…
Hadoop MapReduce….
• The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
– Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
– Reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
• Easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called
mappers and reducers.
MapReduce-Advantages
Parallel Processing
MapReduce-Advantages..
Data Locality- Processing to Storage
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting-TraditionalWay
MapReduce- Traditional Vs MapReduce Way
ElectionVote Counting- MapReduceWay
• MapReduce program executes in three stages, namely map
stage, shuffle stage, and reduce stage.
–Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS).
–The input file is passed to the mapper function line by line.The
mapper processes the data and creates several small chunks of
data.
–Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage.The Reducer’s job is to
process the data that comes from the mapper.After processing, it
produces a new set of output, which will be stored in the HDFS.
MapReduce-Execution Stages
MapReduce Process
Anatomy of MapReduce Program
MapReduce: Map-Shuffle-Reduce
MapReduce-Example word count Process
• How Hadoop runs MapReduce Jobs?
• How Hadoop runs MapReduce Jobs?
• How Hadoop runs MapReduce Jobs?
Introduction to MapReduce…..
• Input reader
– The input reader reads the upcoming data and splits it into the data
blocks of the appropriate size (64 MB to 128 MB). Each data block is
associated with a Map function.
– Once input reads the data, it generates the corresponding key-value pairs.
– The input files reside in HDFS.The input data can be in any form.
• Map function
– The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.The map input and output type may
be different from each other.
Data Flow In Mapreduce (Phases)
• Partition function
– The partition function assigns the output of each Map function to the
appropriate reducer.The available key and value provide this function. It
returns the index of reducers.
• Shuffling and Sorting
– The data are shuffled between/within nodes so that it moves out from
the map and get ready to process for reduce function. Sometimes, the
shuffling of data can take much computation time.
– The sorting operation is performed on input data for Reduce function.
Here, the data is compared using comparison function and arranged in
a sorted form.
Data Flow In Mapreduce..
• Reduce function
– The Reduce function is assigned to each unique key.These keys are
already arranged in sorted order.The values associated with the keys
can iterate the Reduce and generates the corresponding output.
• Output writer
– Once the data flow from all the above phases, Output writer executes.
The role of Output writer is to write the Reduce output to the stable
storage.
Data Flow in Mapreduce…
MapReduc- Characterstics
• MapReduce Mapper Class
– In MapReduce, the role of the Mapper class is to map the input key-value
pairs to a set of intermediate key-value pairs.
– It transforms the input records into intermediate records.
– These intermediate records associated with a given output key and passed to
Reducer for the final output.
• MapReduce Reducer Class
– to reduce the set of intermediate values.
– Its implementations can access the Configuration for the job via the
JobContext.getConfiguration() method.
• MapReduce Job Class
– The Job class is used to configure the job and submits it. It also controls the
execution and query the state. Once the job is submitted, the set method
throws IllegalStateException.
MapReduce API
• Yet Another Resource Manager takes programming to the next level beyond Java,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.
• Different Yarn applications can co-exist on the same cluster so MapReduce,
Hbase, Spark all can run at the same time bringing great benefits for
manageability and cluster utilization.
• Jobtracker & Tasktracker were used in previous version of Hadoop,
which were responsible for handling resources and checking progress
management.
• However, Hadoop 2.0 has Resource manager and NodeManager to
overcome the shortfall of Jobtracker & Tasktracker.
Overview of YARN-Component of Hadoop 2.0
Limitations of Hadoop 1.0 (MR 1)
Needs of YARN
YARN as a Solution
• Client: For submitting MapReduce jobs.
• Resource Manager: To manage the use of resources across the
cluster
• Node Manager: For launching and monitoring the computer
containers on machines in the cluster.
• Map Reduce Application Master: Checks tasks running the
MapReduce job.
– The application master and the MapReduce tasks run in containers that
are scheduled by the resource manager, and managed by the node
managers.
Components of YARN
Components of YARN
YARN Application Workflow in MapReduce
YARN Application Workflow in MapReduce
YARN Application Workflow in MapReduce
• There are mainly 3 types of Schedulers in Hadoop:
–FIFO (First In First Out) Scheduler
–Capacity Scheduler
–Fair Scheduler
Types of Scheduling
• Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes
and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh
tasks.
• Utilization: Node Manager manages a pool of resources, rather
than a fixed number of the designated slots thus increasing the
utilization.
• Multitenancy: Different version of MapReduce can run onYARN,
which makes the process of upgrading MapReduce more manageable.
Benefits of YARN
Tools &Techniques to Analyze
Big Data
Dec, 2021
Assosa, Ethiopia
Contents
• Introduction
• Abstracting Hadoop MapReduce jobs with Pig
• Performing ad hoc Big Data querying with Hive
• Creating business value from extracted data
Objectives
• At the end of this chapter, you are able to:
–Identify different tools and techniques for big data
–Understand the concepts of pig
–Understand the concepts of Hive
Introduction
• Abstracting Hadoop MapReduce jobs with Pig
–Communicating with Hadoop in Pig Latin
–Executing commands using the Grunt Shell
–Streamlining high-level processing
• Performing ad hoc Big Data querying with Hive
–Persisting data in the Hive MegaStore
–Performing queries with HiveQL
–Investigating Hive file formats
• Creating business value from extracted data
–Mining data with Mahout
–Visualizing processed results with reporting tools, BI
–Querying in real time with Impala
Big Data Hadoop Projects
Abstracting Hadoop MapReduce jobs with Pig
• Pig was Initially developed byYahoo to get ease in programming.
– Apache Pig has the capability to process an extensive dataset as it works on
top of the Hadoop. It is used for analyzing more massive datasets by
representing them as dataflow.
– Apache Pig also raises the level of abstraction for processing enormous
datasets.
– Pig Latin is the scripting language that the developer uses for working on
the Pig framework that runs on Pig runtime.
• Features of Pig:
– EasyTo Programme
– Rich set of operators
– Ability to handle various kind of data
– Extensibility
Performing ad hoc Big Data querying with Hive
Performing ad hoc Big Data querying with Hive..
Why??
Tools for big data
The tools are in Categories
• The first row is NoSQL storage
• The second row is common big data storage and management
Tools for big data..
Tools for big data…
Tools for big data….
Tools for big data-Advantages
Developing a Big Data Strategy
Dec, 2021
Assosa, Ethiopia
Contents
• Introduction
• Overview of Big Data Strategy
• Defining Big Data Strategy
• Enabling analytical innovation
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts of big data Strategy
–Explain a Big Data strategy and considerations
Introduction
• Strategy is a plan of action or policy designed to achieve an overall aim.
• Big Data is Worthless Without a Big Data Strategy.
• However, it cannot be seen as something separate from the
organizational strategy, and should be firmly embedded.
• When we say a Big Data strategy, this effectively means a business
strategy that includes Big Data.
• It defines and lays out a comprehensive vision across the enterprise and
sets a foundation for the organization to employ data-related or data-
dependent capabilities.
• A well-defined and comprehensive Big Data strategy makes the benefits
or Big Data actionable for the organization.
• It sets out the steps that an organization should execute in order to
become a “Data Driven Enterprise”.
Introduction..
• The Big Data strategy incorporates some guiding principles to
accomplish
– the data-driven vision,
– directs the organization to select specific business goals and is the starting
point for data driven planning across the enterprise.
• Big data holds many promises, such as
– gaining valuable customer insights
– predict future
– generate new revenue streams etc
• So effective big data strategy is very necessary.
Big Data Considerations
• You can’t process the amount of data that you want to because of the
limitations of your current platform.
• You can’t include new/contemporary data sources (example, social media,
RFID, Sensory, Web, GPS, textual data) because it does not comply with
the data storage schema.
• You need to (or want to) integrate data as quickly as possible to be
current on your analysis.
• You want to work with a schema-on-demand data storage paradigm
because the variety of data types involved.
• The data is arriving so fast at your organization’s doorstep that your
traditional analytics platform cannot handle it
Critical Success Factors for Big Data Analytics
• A clear business need (alignment with the vision and the strategy).
• Strong, committed sponsorship (executive champion)
• Alignment between the business and I T strategy.
• A fact-based decision-making culture.
• A strong data infrastructure.
• The right analytics tools.
• Right people with right skills
Business Problems Addressed by Big Data
Analytics
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities
What is Big Data Objectives?
• The technologies and concepts behind big data allow
organizations to achieve a variety of objectives.
• Like many new information technologies,
–big data can bring about
• dramatic cost reductions,
• substantial improvements in the time required to perform a
computing task,
• or new product and service offerings
Defining a Big Data Strategy
• A good Big Data Strategy will explore following subject
domain, and align it to their organizational objectives:
1. Identify an Opportunity & EconomicValue of Data
2. Defining Big Data Architecture
3. Selecting Big DataTechnologies
4. Understanding Big Data Science
5. Developing Big Data Analytics
6. Institutionalize Big Data
Defining a Big Data Strategy
1. Identify an Opportunity & EconomicValue of Data
–Catalog existing data sources available inside the organization, tapped
or untapped.
–Invent new ways of capturing data, integrate your data sources with
external communities. Develop semantics and metadata for
association, clustering, classification and trending.
–Identify and create opportunities to integrate and fuse data with
partner's dataset in industry likeTelecom,Travel, Financial, Healthcare,
and Entertainment Industries etc.
–Conceptualize the data insights and possible data sciences to extract
valuable data, e.g. associations, simulation, regression, correlation,
segmentation, trending, and predictive etc.
Defining a Big Data Strategy
Identify an Opportunity & EconomicValue of Data…
–Identify the scope of data access. i.e.Who can explore data? Who gets
access to Data Insights.
–Identify possibility of monetizing data to generate revenue from Data
Insights gained, like generating leads, campaigns, upsell/cross sell
opportunity, data streaming, data API, improving staff productivity &
customer service etc.
–Identify ethical and legal code associated with data under exploration
with respect to industry standards, organizational culture, data
policies, data privacy, regulatory and legal requirements
–Data Requirement likeWhat type of data do you need? Is it diverse
enough? How will you source it and store it?
Defining a Big Data Strategy
2. Defining Big Data Architecture
– Defining Business Problems & Classification of associated data, such as,
Market Sentiment Analysis, Churn Predictions, or Fraud Detection.
– Defining Data Acquisition Strategy.
– Selecting a Hadoop Framework
– Big Data Life Cycle Management Framework.
– Choosing Big Data stores, traditional or noSQL and Polyglot
persistence.
– Defining Big Data Infrastructure & PlatformTaxonomy
– Identifying Big Data Analytics Frameworks, and associated Machine
Learning Sciences.
– develop Data Monetization Strategy to exploit its value internally
within the enterprise, or externally
Defining a Big Data Strategy
3. Selecting Big DataTechnologies
• Having the appropriate infrastructure in place to support the data you need is
essential.
• Be sure to consider the four layers of data including collecting, storing,
processing/analysing and communicating insights from the data
– InternetTechnologies
– Machine learning
– Commodity Hardware
– Distributed processing
– Leverage Cloud based approach to reduce time to market, reduce risk
and gain better SLA out of the box
Defining a Big Data Strategy
4. Understanding Big Data Science
– Data Science is the ongoing process of discovering information from data. It is a
process that never stops, and often one question leads to another new question.
It focuses on real-world problems, and tries to explain it
– Machine Learning
• Supervised
• Unsupervised, hybride
– Common Algorithms
• Classification
• Clustering
• Associations & Correlations
• Text Mining
• Linear Regression
Defining a Big Data Strategy
5. Developing Big Data Analytics
• Big Data applications varies based on industry. Businesses are trying to find
value in monetizing data, or use it to improve efficiencies and customer
experience.
• Considering the different types of big data analytics is important
– Descriptive
– Diagnostic
– Predictive
– Prescriptive
Defining a Big Data Strategy
6. Institutionalize Big Data
• Each enterprise will tailor the Big Data to meet the objectives of their
particular vision.
– Discovery (Opportunity, Requirements, Best Fit)
– Proof of Concept (to Evaluate BusinessValue)
– Provision Infra (here Big Data Elasticity comes to play)
– Ingest (Source the data)
– Process (Transform,Analyze, Data Science)
– Publish (Share the learnings)
• Governance: the current state of data quality, security, access,
ownership, ethics and data privacy within the organization.
• Considering Skill and capacity is also required
Defining a Big Data Strategy
Generally:
–Establishing your Big Data needs
–Meeting business goals with timely data
–Evaluating commercial Big Data tools
–Managing organizational expectations is very important when we
define a big data strategy
Enabling analytical innovation in a Big Data
• Data can drive innovation in two ways.
– Data can motivate ideation, development, execution and
evaluation of new innovations.
– And it can underpin, or be a central component of new products,
services, operations or business models.
• Recent advances in machine learning and the vast amount of digitalized
data.
– For the first time, a machine powered by analytics was able to win against
the best human player in the world in the game “Go.”
• Self-driving cars that rely on the large number of digitized images that
improve vision recognition systems dramatically.
Enabling analytical innovation in a Big Data..
How big data fuel innovation?
• “Analytics is really great at finding linkages or hidden patterns we
may not easily observe by mining through a ton of data.”
• “Analytics can really drive the creation of recombination’, or
combining a diverse set of existing technologies in a new way.”
• “We can use lessons learned from past generations of IT and
analytics technologies to inform us about what the future could look
like.”
– Focusing on business importance
– Framing the problem
– Selecting the correct tools
– Achieving timely results
Implementing a Big Data
Solution
Jan, 2022
Assosa, Ethiopia
Contents
• Introduction
• Selecting suitable vendors and hosting options
• Balancing costs against business value
• Keeping ahead of the curve
Objectives
• At the end of this chapter, you are able to:
–Understand the concepts and criterion of Selecting suitable
vendors and hosting options
– Explain Balancing costs against business value
Introduction
• To be sure, big data solutions are in great demand.
• Today, enterprise leaders know that their big data is one of
their most valuable resources and one they can’t afford to
ignore.
• As a result, they are looking for hardware and software that
can
–help them store, manage and analyze their big data.
• Experts suggest that a good way to start the process of
selecting a big data solution is
–to determine exactly what kind of solution you need.
Big Data Market Share
Big Data Market Share..
Big Data Software Market Shares
Top Big Data Software Provider Companies
• SAP
• Splunk
• Oracle
• IBM
• Microsoft
Big Data Professional Service Market Share
Top Big Data Professional Service Provider
Companies
• IBM
• Accenture
• Palantir
• Teradata
Big Data centered Industry Landscape
Big Data
IoT Cloud
Mobile Bio
Reading
• What are the criterion to select suitable vendors
and hosting options?
• Hardware
• Software
• Professional service
• How to consider balancing of costs against business
value generated from big data
• The knowledge and skill required to Keep ahead of
the curve
Common Types of Big Data Solution
• Enterprise vendors offer a wide array of different types of big
data solutions.
• The kind of big data application that is right for organizations
will depend on their goals
• The best approach is to define the goals clearly at the outset
and then go looking for products that will help to reach those
goals.
A Big Data Solution Hosting Options
• On-PremiseVs Cloud-Based Big Data Applications
–Want to host a big data software in organization data center or use a
cloud-based solution.
• ProprietaryVs Open Source Big Data Applications
–Is the organization have skill professional, open source solutions up
and running and configured for their needs
–need to purchase support or consulting services (consider those
expenses when figuring out total cost of ownership)
• BatchVs Streaming Big Data Applications
– Does organizations want to analyze data in real-time or batch?
–Both real-time and batch data processing Lambda architecture
Selection Criterion Or Success Factors
• Integration with Legacy Technology
• Performance
• Scalability
• Usability
• Visualization
• Flexibility
• Security
• Support
–Even experienced IT professionals sometimes find it
difficult to deploy, maintain and use complex big data
applications.
• Ecosystem: a big data platform that integrates with a lot of
other popular tools and a vendor with strong partnerships with
other providers
• Self-Service Capabilities
• Total Cost of Ownership
• Estimated time to value
• Artificial Intelligence and Machine Learning
– how innovative the various big data solution vendors are?
–AI and machine learning research are advancing at an incredible rate
and becoming a mainstream part of big data analytics solution
Selection Criterion Or Success Factors..
105
https://www.datamation.com/big-data/how-to-select-a-big-data-application/

Contenu connexe

Similaire à Big Data Analytics Chapter3-6@2021.pdf

Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 

Similaire à Big Data Analytics Chapter3-6@2021.pdf (20)

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 

Dernier

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Big Data Analytics Chapter3-6@2021.pdf

  • 1. COLLEGE OF COMPUTING AND INFORMATICS MSc. in Information Technology Processing Big Data Nov, 2021 Assosa, Ethiopia
  • 2. Contents • Introduction • Integrating disparate data stores • Employing Hadoop MapReduce • Building blocks of Hadoop MapReduce
  • 3. Objectives • At the end of this chapter, you are able to: –Understand the concepts of big data Processing –Explain the building blocks of Hadoop MapReduce –Understand the concepts of YARN –Explain Big Data with traditional data
  • 4. Introduction • Data processing, manipulation of data by a computer. – It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output. – Any use of computers to perform defined operations on data can be included under data processing • Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. • This is because, Big Data helps companies to generate valuable insights. – Companies use Big Data to refine their marketing campaigns and techniques. – Companies use it in machine learning projects to train machines, predictive modeling, and other advanced analytics applications
  • 5. Algorithm to analyse: • Association • Classification • Integration
  • 7. Types of Data Integration Tools includes
  • 8. Types of Data Integration Tools includes..
  • 9. Types of Data Integration Tools includes…
  • 10. Examples of Data Integration
  • 12. Common Data Integration Approaches..
  • 13. Hadoop MapReduce • Big data processing is not handled by a single machine. Due to this Mapreduce is used for handling processing of big data in a parallel and distributed manner. • A framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. • A processing technique and a program model for distributed computing based on java.
  • 16. Hadoop MapReduce…. • The MapReduce algorithm contains two important tasks, namely Map and Reduce. – Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). – Reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. • As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. • Easy to scale data processing over multiple computing nodes. • Under the MapReduce model, the data processing primitives are called mappers and reducers.
  • 19. MapReduce- Traditional Vs MapReduce Way ElectionVote Counting
  • 20. MapReduce- Traditional Vs MapReduce Way ElectionVote Counting-TraditionalWay
  • 21. MapReduce- Traditional Vs MapReduce Way ElectionVote Counting- MapReduceWay
  • 22.
  • 23. • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. –Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). –The input file is passed to the mapper function line by line.The mapper processes the data and creates several small chunks of data. –Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage.The Reducer’s job is to process the data that comes from the mapper.After processing, it produces a new set of output, which will be stored in the HDFS. MapReduce-Execution Stages
  • 28.
  • 29. • How Hadoop runs MapReduce Jobs?
  • 30. • How Hadoop runs MapReduce Jobs?
  • 31. • How Hadoop runs MapReduce Jobs?
  • 33. • Input reader – The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Each data block is associated with a Map function. – Once input reads the data, it generates the corresponding key-value pairs. – The input files reside in HDFS.The input data can be in any form. • Map function – The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs.The map input and output type may be different from each other. Data Flow In Mapreduce (Phases)
  • 34. • Partition function – The partition function assigns the output of each Map function to the appropriate reducer.The available key and value provide this function. It returns the index of reducers. • Shuffling and Sorting – The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce function. Sometimes, the shuffling of data can take much computation time. – The sorting operation is performed on input data for Reduce function. Here, the data is compared using comparison function and arranged in a sorted form. Data Flow In Mapreduce..
  • 35. • Reduce function – The Reduce function is assigned to each unique key.These keys are already arranged in sorted order.The values associated with the keys can iterate the Reduce and generates the corresponding output. • Output writer – Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce output to the stable storage. Data Flow in Mapreduce…
  • 37. • MapReduce Mapper Class – In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of intermediate key-value pairs. – It transforms the input records into intermediate records. – These intermediate records associated with a given output key and passed to Reducer for the final output. • MapReduce Reducer Class – to reduce the set of intermediate values. – Its implementations can access the Configuration for the job via the JobContext.getConfiguration() method. • MapReduce Job Class – The Job class is used to configure the job and submits it. It also controls the execution and query the state. Once the job is submitted, the set method throws IllegalStateException. MapReduce API
  • 38. • Yet Another Resource Manager takes programming to the next level beyond Java, and makes it interactive to let another application Hbase, Spark etc. to work on it. • Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great benefits for manageability and cluster utilization. • Jobtracker & Tasktracker were used in previous version of Hadoop, which were responsible for handling resources and checking progress management. • However, Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker. Overview of YARN-Component of Hadoop 2.0
  • 39.
  • 40.
  • 41.
  • 42. Limitations of Hadoop 1.0 (MR 1)
  • 43.
  • 45. YARN as a Solution
  • 46.
  • 47. • Client: For submitting MapReduce jobs. • Resource Manager: To manage the use of resources across the cluster • Node Manager: For launching and monitoring the computer containers on machines in the cluster. • Map Reduce Application Master: Checks tasks running the MapReduce job. – The application master and the MapReduce tasks run in containers that are scheduled by the resource manager, and managed by the node managers. Components of YARN
  • 52. • There are mainly 3 types of Schedulers in Hadoop: –FIFO (First In First Out) Scheduler –Capacity Scheduler –Fair Scheduler Types of Scheduling
  • 53. • Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, butYarn is designed for 10,000 nodes and 1 lakh tasks. • Utilization: Node Manager manages a pool of resources, rather than a fixed number of the designated slots thus increasing the utilization. • Multitenancy: Different version of MapReduce can run onYARN, which makes the process of upgrading MapReduce more manageable. Benefits of YARN
  • 54. Tools &Techniques to Analyze Big Data Dec, 2021 Assosa, Ethiopia
  • 55. Contents • Introduction • Abstracting Hadoop MapReduce jobs with Pig • Performing ad hoc Big Data querying with Hive • Creating business value from extracted data
  • 56. Objectives • At the end of this chapter, you are able to: –Identify different tools and techniques for big data –Understand the concepts of pig –Understand the concepts of Hive
  • 57. Introduction • Abstracting Hadoop MapReduce jobs with Pig –Communicating with Hadoop in Pig Latin –Executing commands using the Grunt Shell –Streamlining high-level processing • Performing ad hoc Big Data querying with Hive –Persisting data in the Hive MegaStore –Performing queries with HiveQL –Investigating Hive file formats • Creating business value from extracted data –Mining data with Mahout –Visualizing processed results with reporting tools, BI –Querying in real time with Impala
  • 58. Big Data Hadoop Projects
  • 59. Abstracting Hadoop MapReduce jobs with Pig • Pig was Initially developed byYahoo to get ease in programming. – Apache Pig has the capability to process an extensive dataset as it works on top of the Hadoop. It is used for analyzing more massive datasets by representing them as dataflow. – Apache Pig also raises the level of abstraction for processing enormous datasets. – Pig Latin is the scripting language that the developer uses for working on the Pig framework that runs on Pig runtime. • Features of Pig: – EasyTo Programme – Rich set of operators – Ability to handle various kind of data – Extensibility
  • 60. Performing ad hoc Big Data querying with Hive
  • 61. Performing ad hoc Big Data querying with Hive.. Why??
  • 62. Tools for big data The tools are in Categories • The first row is NoSQL storage • The second row is common big data storage and management
  • 63. Tools for big data..
  • 64. Tools for big data…
  • 65. Tools for big data….
  • 66. Tools for big data-Advantages
  • 67. Developing a Big Data Strategy Dec, 2021 Assosa, Ethiopia
  • 68. Contents • Introduction • Overview of Big Data Strategy • Defining Big Data Strategy • Enabling analytical innovation
  • 69. Objectives • At the end of this chapter, you are able to: –Understand the concepts of big data Strategy –Explain a Big Data strategy and considerations
  • 70. Introduction • Strategy is a plan of action or policy designed to achieve an overall aim. • Big Data is Worthless Without a Big Data Strategy. • However, it cannot be seen as something separate from the organizational strategy, and should be firmly embedded. • When we say a Big Data strategy, this effectively means a business strategy that includes Big Data. • It defines and lays out a comprehensive vision across the enterprise and sets a foundation for the organization to employ data-related or data- dependent capabilities. • A well-defined and comprehensive Big Data strategy makes the benefits or Big Data actionable for the organization. • It sets out the steps that an organization should execute in order to become a “Data Driven Enterprise”.
  • 71. Introduction.. • The Big Data strategy incorporates some guiding principles to accomplish – the data-driven vision, – directs the organization to select specific business goals and is the starting point for data driven planning across the enterprise. • Big data holds many promises, such as – gaining valuable customer insights – predict future – generate new revenue streams etc • So effective big data strategy is very necessary.
  • 72. Big Data Considerations • You can’t process the amount of data that you want to because of the limitations of your current platform. • You can’t include new/contemporary data sources (example, social media, RFID, Sensory, Web, GPS, textual data) because it does not comply with the data storage schema. • You need to (or want to) integrate data as quickly as possible to be current on your analysis. • You want to work with a schema-on-demand data storage paradigm because the variety of data types involved. • The data is arriving so fast at your organization’s doorstep that your traditional analytics platform cannot handle it
  • 73. Critical Success Factors for Big Data Analytics • A clear business need (alignment with the vision and the strategy). • Strong, committed sponsorship (executive champion) • Alignment between the business and I T strategy. • A fact-based decision-making culture. • A strong data infrastructure. • The right analytics tools. • Right people with right skills
  • 74.
  • 75. Business Problems Addressed by Big Data Analytics • Process efficiency and cost reduction • Brand management • Revenue maximization, cross-selling/up-selling • Enhanced customer experience • Churn identification, customer recruiting • Improved customer service • Identifying new products and market opportunities • Risk management • Regulatory compliance • Enhanced security capabilities
  • 76. What is Big Data Objectives? • The technologies and concepts behind big data allow organizations to achieve a variety of objectives. • Like many new information technologies, –big data can bring about • dramatic cost reductions, • substantial improvements in the time required to perform a computing task, • or new product and service offerings
  • 77. Defining a Big Data Strategy • A good Big Data Strategy will explore following subject domain, and align it to their organizational objectives: 1. Identify an Opportunity & EconomicValue of Data 2. Defining Big Data Architecture 3. Selecting Big DataTechnologies 4. Understanding Big Data Science 5. Developing Big Data Analytics 6. Institutionalize Big Data
  • 78. Defining a Big Data Strategy 1. Identify an Opportunity & EconomicValue of Data –Catalog existing data sources available inside the organization, tapped or untapped. –Invent new ways of capturing data, integrate your data sources with external communities. Develop semantics and metadata for association, clustering, classification and trending. –Identify and create opportunities to integrate and fuse data with partner's dataset in industry likeTelecom,Travel, Financial, Healthcare, and Entertainment Industries etc. –Conceptualize the data insights and possible data sciences to extract valuable data, e.g. associations, simulation, regression, correlation, segmentation, trending, and predictive etc.
  • 79. Defining a Big Data Strategy Identify an Opportunity & EconomicValue of Data… –Identify the scope of data access. i.e.Who can explore data? Who gets access to Data Insights. –Identify possibility of monetizing data to generate revenue from Data Insights gained, like generating leads, campaigns, upsell/cross sell opportunity, data streaming, data API, improving staff productivity & customer service etc. –Identify ethical and legal code associated with data under exploration with respect to industry standards, organizational culture, data policies, data privacy, regulatory and legal requirements –Data Requirement likeWhat type of data do you need? Is it diverse enough? How will you source it and store it?
  • 80. Defining a Big Data Strategy 2. Defining Big Data Architecture – Defining Business Problems & Classification of associated data, such as, Market Sentiment Analysis, Churn Predictions, or Fraud Detection. – Defining Data Acquisition Strategy. – Selecting a Hadoop Framework – Big Data Life Cycle Management Framework. – Choosing Big Data stores, traditional or noSQL and Polyglot persistence. – Defining Big Data Infrastructure & PlatformTaxonomy – Identifying Big Data Analytics Frameworks, and associated Machine Learning Sciences. – develop Data Monetization Strategy to exploit its value internally within the enterprise, or externally
  • 81. Defining a Big Data Strategy 3. Selecting Big DataTechnologies • Having the appropriate infrastructure in place to support the data you need is essential. • Be sure to consider the four layers of data including collecting, storing, processing/analysing and communicating insights from the data – InternetTechnologies – Machine learning – Commodity Hardware – Distributed processing – Leverage Cloud based approach to reduce time to market, reduce risk and gain better SLA out of the box
  • 82. Defining a Big Data Strategy 4. Understanding Big Data Science – Data Science is the ongoing process of discovering information from data. It is a process that never stops, and often one question leads to another new question. It focuses on real-world problems, and tries to explain it – Machine Learning • Supervised • Unsupervised, hybride – Common Algorithms • Classification • Clustering • Associations & Correlations • Text Mining • Linear Regression
  • 83. Defining a Big Data Strategy 5. Developing Big Data Analytics • Big Data applications varies based on industry. Businesses are trying to find value in monetizing data, or use it to improve efficiencies and customer experience. • Considering the different types of big data analytics is important – Descriptive – Diagnostic – Predictive – Prescriptive
  • 84. Defining a Big Data Strategy 6. Institutionalize Big Data • Each enterprise will tailor the Big Data to meet the objectives of their particular vision. – Discovery (Opportunity, Requirements, Best Fit) – Proof of Concept (to Evaluate BusinessValue) – Provision Infra (here Big Data Elasticity comes to play) – Ingest (Source the data) – Process (Transform,Analyze, Data Science) – Publish (Share the learnings) • Governance: the current state of data quality, security, access, ownership, ethics and data privacy within the organization. • Considering Skill and capacity is also required
  • 85. Defining a Big Data Strategy Generally: –Establishing your Big Data needs –Meeting business goals with timely data –Evaluating commercial Big Data tools –Managing organizational expectations is very important when we define a big data strategy
  • 86. Enabling analytical innovation in a Big Data • Data can drive innovation in two ways. – Data can motivate ideation, development, execution and evaluation of new innovations. – And it can underpin, or be a central component of new products, services, operations or business models. • Recent advances in machine learning and the vast amount of digitalized data. – For the first time, a machine powered by analytics was able to win against the best human player in the world in the game “Go.” • Self-driving cars that rely on the large number of digitized images that improve vision recognition systems dramatically.
  • 87. Enabling analytical innovation in a Big Data.. How big data fuel innovation? • “Analytics is really great at finding linkages or hidden patterns we may not easily observe by mining through a ton of data.” • “Analytics can really drive the creation of recombination’, or combining a diverse set of existing technologies in a new way.” • “We can use lessons learned from past generations of IT and analytics technologies to inform us about what the future could look like.” – Focusing on business importance – Framing the problem – Selecting the correct tools – Achieving timely results
  • 88. Implementing a Big Data Solution Jan, 2022 Assosa, Ethiopia
  • 89. Contents • Introduction • Selecting suitable vendors and hosting options • Balancing costs against business value • Keeping ahead of the curve
  • 90. Objectives • At the end of this chapter, you are able to: –Understand the concepts and criterion of Selecting suitable vendors and hosting options – Explain Balancing costs against business value
  • 91. Introduction • To be sure, big data solutions are in great demand. • Today, enterprise leaders know that their big data is one of their most valuable resources and one they can’t afford to ignore. • As a result, they are looking for hardware and software that can –help them store, manage and analyze their big data. • Experts suggest that a good way to start the process of selecting a big data solution is –to determine exactly what kind of solution you need.
  • 93. Big Data Market Share..
  • 94. Big Data Software Market Shares
  • 95. Top Big Data Software Provider Companies • SAP • Splunk • Oracle • IBM • Microsoft
  • 96. Big Data Professional Service Market Share
  • 97. Top Big Data Professional Service Provider Companies • IBM • Accenture • Palantir • Teradata
  • 98. Big Data centered Industry Landscape Big Data IoT Cloud Mobile Bio
  • 99. Reading • What are the criterion to select suitable vendors and hosting options? • Hardware • Software • Professional service • How to consider balancing of costs against business value generated from big data • The knowledge and skill required to Keep ahead of the curve
  • 100. Common Types of Big Data Solution • Enterprise vendors offer a wide array of different types of big data solutions. • The kind of big data application that is right for organizations will depend on their goals • The best approach is to define the goals clearly at the outset and then go looking for products that will help to reach those goals.
  • 101.
  • 102. A Big Data Solution Hosting Options • On-PremiseVs Cloud-Based Big Data Applications –Want to host a big data software in organization data center or use a cloud-based solution. • ProprietaryVs Open Source Big Data Applications –Is the organization have skill professional, open source solutions up and running and configured for their needs –need to purchase support or consulting services (consider those expenses when figuring out total cost of ownership) • BatchVs Streaming Big Data Applications – Does organizations want to analyze data in real-time or batch? –Both real-time and batch data processing Lambda architecture
  • 103. Selection Criterion Or Success Factors • Integration with Legacy Technology • Performance • Scalability • Usability • Visualization • Flexibility • Security • Support –Even experienced IT professionals sometimes find it difficult to deploy, maintain and use complex big data applications.
  • 104. • Ecosystem: a big data platform that integrates with a lot of other popular tools and a vendor with strong partnerships with other providers • Self-Service Capabilities • Total Cost of Ownership • Estimated time to value • Artificial Intelligence and Machine Learning – how innovative the various big data solution vendors are? –AI and machine learning research are advancing at an incredible rate and becoming a mainstream part of big data analytics solution Selection Criterion Or Success Factors..