Big data analytics - hadoop

Big Data Analytics - Hadoop
Vishwajeet Jadeja
MSc Statistics
Department of Statistics
The Maharaja Sayajirao University of Baroda

Introduction
• Big data burst upon the scene in the first decade of the 21st century.
• Big data analytics is the process of examining large data sets
containing a variety of data types -- i.e., “big data” -- to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information.
• The analytical findings can lead to more effective marketing, new
revenue opportunities, better customer service, improved operational
efficiency, competitive advantages over competing organizations and
other business benefits.
• “Data analytics” is used to describe statistical and mathematical data
analysis that clusters, segments, scores and predicts what scenarios are
most likely to happen.

Introduction (contd.)
• Arguably, firms like Google, eBay, LinkedIn, and Facebook were built
around big data from the beginning.
• Analytics on big data have to coexist with analytics on other types of data.
• Hadoop clusters have to do their work alongside IBM mainframes.
• Data scientists must somehow get along and work jointly with mere
quantitative analysts.
• Firms that have long handled massive volumes of data are beginning to
enthuse about the ability to handle a new type of data—voice or text or log
files or images or video.

Examples
• A retail bank is getting a handle on its multi-channel customer interactions
for the first time by analyzing log files.
• A hotel firm is analyzing customer lines with video analytics.
• A health insurer is able to better predict customer dissatisfaction by
analyzing speech-to-text data from call center recordings.
In short, these companies can have a much more complete picture of their
customers and operations by combining unstructured and structured data.

Objectives for Big Data
• Cost Reduction from Big Data Technologies
• Time Reduction from Big Data
• Developing New Big Data-Based Offerings
• Supporting Internal Business Decisions

Cost Reduction from Big Data Technologies
• Some organizations pursuing big data believe strongly that MIPS and
terabyte storage for structured data are now most cheaply delivered through
big data technologies like Hadoop clusters.
• Organizations that were focused on cost reduction made the decision to
adopt big data tools primarily within the IT organization on largely
technical and economic criteria.

Time Reduction from Big Data
• The second common objective of big data technologies and solutions is time
reduction.
• Key objective involving time reduction is to be able to interact with the
customer in real time, using analytics and data derived from the customer
experience.
• If the customer has “left the building,” targeted offers and services are likely
to be much less effective.
• This means rapid data capture, aggregation, processing, and analytics.

Developing New Big Data-Based Offerings
• One of the most ambitious things an organization can do with big data is to
employ it in developing new product and service offerings based on data.
• Many of the companies that employ this approach are online firms, which
have an obvious need to employ data-based products and services.
• Ex. LinkedIn, Google, etc.

Supporting Internal Business Decisions
• The primary purpose behind traditional, “small data” analytics was to
support internal business decisions. What offers should be presented to a
customer? Which customers are most likely to stop being customers soon?
How much inventory should be held in the warehouse? How should we
price our products?
• These types of decisions employ big data when there are new, less
structured data sources that can be applied to the decision.
• Business decisions with big data can also involve other traditional areas for
analytics such as supply chains, risk management, or pricing. The factor that
makes these big data problems, rather than small, is the use of external data
to improve the analysis.

Hadoop: Introduction
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
• It is made by apache software foundation in 2011.
• Written in JAVA.

History
• Inventor: Doug Cutting, creator of Apache Lucene.
• The Origin of the Name “Hadoop”:
“The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere: those are my
naming criteria.” ---Daug Cutting.
• Google’s GFS in 2003 solved storage problem.
• Feb 2006 they moved out of Nutch to form an independent subproject of
Lucene called Hadoop.

History (contd.)
• At around the same time, Doug Cutting joined Yahoo.
• February 2008 , Yahoo! announced that its production search index was
being generated by a 10,000-core Hadoop cluster.
• In January 2008, Hadoop was made its own top-level project at apache,
confirming its success and its diverse, active community.

What we’ve got : Hadoop!
• Fault-tolerant file system.
• Hadoop Distributed File System (HDFS)
• Modeled on Google File system
• Takes computation to data
• Data Locality
• Scalability:
• Program remains same for 10, 100, 1000,… nodes
• Corresponding performance improvement
• Parallel computation using MapReduce
• Other components – Pig, Hbase, HIVE, ZooKeeper

HDFS: Hadoop Distributed File System
• Filesystems that manage the storage across a network of machines are
called distributed filesystems.
• Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
• HDFS, the Hadoop Distributed File System, is a distributed file system
designed to hold very large amounts of data (terabytes or even petabytes),
and provide high-throughput access to this information.
• It ties so many small and reasonable priced machines together into a single
cost effective computer cluster.

HDFS: Hadoop Distributed File System
(contd.)
• Data and application processing are protected against hardware failure.
• If a node goes down, jobs are automatically redirected to other nodes to
make sure the distributed computing does not fail.
• It automatically stores multiple copies of all data.
• It provides simplified programming model which allows user to quickly
read and write the distributed system.

The master node: NameNode
Functions:
• Manages File System- mapping files to blocks and blocks to data nodes
• Maintaining status of data nodes
• Heartbeat
• Datanode sends heartbeat at regular intervals
• If heartbeat is not received, datanode is declared dead
• Blockreport
• DataNode sends list of blocks on it
• Used to check health of HDFS

NameNode Functions
• Replication
• On Datanode failure
• On Disk failure
• On Block corruption
• Data integrity
• Checksum for each block
• Stored in hidden file
• Rebalancing- balancer tool
• Addition of new nodes
• Decommissioning
• Deletion of some files

HDFS Robustness
• Safemode
• At startup: No replication possible
• Receives Heartbeats and Blockreports from Datanode
• Only a percentage of blocks are checked for defined replication factor
All is well  => Exit Safemode
• Replicate blocks wherever necessary

MapReduce
• It is a powerful paradigm for parallel computation.
• Hadoop uses MapReduce to execute jobs on files in HDFS.
• Hadoop will intelligently distribute computation over cluster.
• It is an associative implementation for processing and generating large data
sets.
• MAP function that process a key pair to generates a set of intermediate key
pairs.
• REDUCE function that merges all intermediate values associated with the
same intermediate key.

Programming model
• Format of input- output
(key, value)
• Map: (k1, v1) → list (k2, v2)
• Reduce: (k2, list v2) → list (k3, v3)

Hadoop example of wordcount of 5000
random alphabets

Big data analytics - hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big data analytics - hadoop

Similaire à Big data analytics - hadoop (20)

Dernier

Dernier (20)

Big data analytics - hadoop