SlideShare une entreprise Scribd logo
1  sur  51
HADOOP INTRODUCTION
BY SUNITHA MUTCHINTHALA
Agenda
What is Data & Big Data
Sources of big data
Facts and Figures of Big Data
 Processing Big Data – Hadoop
 Hadoop Introduction: Hadoop’s
history and advantages
Hadoop Architecture in detail
Hadoop in industry
Data: Data is the collection of raw facts and figures. Data is
unprocessed, that is why data is called collection of raw facts
and figures.
Examples of Data:
 Student data on admission form- bundle of admission forms
contains name, father’s name, address, course, photograph
etc.
 Census Report, Data of citizens - During census, data of all
citizens like number of persons living in a home, literate or
illiterate, number of children, cast, religion etc.
Information: Processed data is called information. When
raw facts and figures are processed and arranged in some
proper order then they become information. Information has
proper meanings. Information is useful in decision-making.
Examples of information:
 Student’s address labels - Stored data of students can be
used to print address labels of students. These address
labels are used to send any intimation / information to
students at their home addresses.
 Census Report, Total Population - Census data is used to
get report/information about total population of a country
and literacy rate, total population of males, females,
children, aged persons, persons in different categories line
cast, religion, age groups etc.
Data to Information:
Ex: The data collected is in a survey report is: ‘HYD20M’
If we process the above data then we understand that code
information about a person as follows:
HYD is city name ‘Hyderabad’,
20 is age
M is to represent ‘MALE’.
Data to Decisions
International system of Units (SI)
Kilobyte KB 103
Megabyte MB 106
Gigabyte GB 109
Terabyte TB 1012
Petabyte PB 1015
Exabyte EB 1018
Zettabyte ZB 1021
Yottabyte YB 1024
Units of data: When dealing with big data, we consider numbers to represent
like megabytes, gigabytes, terabytes etc. Here is the system of units to
represent data.
What is Big Data?
Big Data is a huge collection of data as the name
refers “BIG DATA”. It can’t be processed by
traditional methods because the most of the data
generation is unstructured form.
According to Gartner:
Big data is huge-volume, fast-velocity,
and different variety information assets that
demand innovative platform for enhanced insights
and decision making.
Types of Big Data:
1. Structured Data:
Any data that can be stored, accessed and processed in
form of fixed format is termed as a 'structured' data.
2. Unstructured Data: Any data with unknown form or the
structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple
challenges in terms of its processing for deriving value out
of it.
3. Semi-Structured Data:
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form
but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a
represented in XML file.
TYPES OF BIGDATA
3Vs of Big Data:
1. Volume – Volume refers to the ‘amount of
data’, which is growing day by day at a very fast
pace. The size of data generated by humans,
machines and their interactions on social media
itself is massive.
2. Variety: Variety refers to heterogeneous
sources and the nature of data, both structured
and unstructured.
3. Velocity: The term 'velocity' refers to the
speed of generation of data. How fast the data
generated and processed to meet the demands,
determines real potential in the data.
3 V’S OF BIGDATA
Sources of Big Data:
Sensor networks
Social media
Public web
Purchase records
Medical records
Airlines
Scientific research
So on……..
Facts and Figures of Big Data:
• The big data growth we’ve been witnessing is only
natural. We constantly generate data. On Google alone,
we submit 40,000 search queries per second. That
amounts to 1.2 trillion searches yearly!
• Each minute, 300 new hours of video show up on
YouTube. That’s why there’s more than 1 billion
gigabytes (1 exabyte) of data on its servers!
• People share more than 100 terabytes of data on
Facebook daily. Every minute, users send 31 million
messages and view 2.7 million videos.
• Big data usage statistics indicate people take about 80%
of photos on their smartphones. Considering that only
this year over 1.4 billion devices will be shipped
worldwide, we can only expect this percentage to grow.
• Smart devices (for example, fitness trackers, sensors,
Amazon Echo) produce 5 quintillion bytes of data daily.
FACTS ABOUT BIGDATA
Key Stats:
•The big data market reaching an estimated value
of $103 billion by 2023
•It’s estimated 97.2% of companies are starting to
invest in big data technology
•Every day, internet users create 2.5 quintillion
bytes of data
•IDC’s Digital Universe Study from 2012 found that
just 0.5% of data was actually being analyzed.
•It is estimated that by 2021 every person will
generate about 1.7 megabytes of data per second.
•Companies like Netflix leverages Big Data to save
US$1 billion per year on customer retention.
•80-90% of the data we generate today is
unstructured.
INTERESTING FACTS
 According to studies, the human brain can
store about 2.5 petabytes of data.
 We generate 2.5 quintillion bytes of data
daily.
 By 2020, every person will generate 1.7
megabytes in just a second.
DATA
ANALYTICS
Gartner is predicting that companies that aren’t
investing heavily in analytics by the end of 2020
may not be in business in 2021. (It is assumed
small businesses, such as self-employed
handymen, gardeners, and many artists, are not
included in this prediction.)
DATA ANALYTICS
 Data Analytics is the process of examining raw data (data sets) with
the purpose of drawing conclusions about that information,
increasingly with the aid of specialized systems and software.
 Data Analytics involves applying an algorithmic or mechanical
process to derive insights. For example, running through a number
of data sets to look for meaningful correlations between each other.
 It is used in a number of industries to allow the organizations and
companies to make better decisions as well as verify and disprove
existing theories or models.
 The focus of Data Analytics lies in inference, which is the process of
deriving conclusions that are solely based on what the researcher
already knows.
Big Data and Analytics
• Surprisingly, 99.5% of collected data never gets used
or analysed. So much potential wasted!
• Less than 50% of the structured data collected from IoT is
used in decision making.
• Predictive analytics are becoming more and more crucial for
success. 79% of executives believe that failing to embrace
big data will lead to bankruptcy. This explains why 83% of
companies invest in big data projects.
• Fortune 1000 companies can gain more than $65
million additional net income, only by increasing their data
accessibility with 10%.
• Healthcare could also vastly benefit from big data analytics
adoption. As much as $300 billion can be saved yearly!
• Companies that harness big data’s full power
could increase their operation margins by up to 60%!
TYPES OF DATA ANALYTICS
1. DESCRIPTIVE ANALYTICS:
 The simplest way to define descriptive analytics is that, it answers
the question “What has happened?”
 This type of analytics, analyses the data coming in real-time and
historical data for insights on how to approach the future.
 The main objective of descriptive analytics is to find out the reasons
behind precious success or failure in the past.
 The ‘Past’ here, refers to any particular time in which an event had
occurred and this could be a month ago or even just a minute ago.
 The vast majority of big data analytics used by organizations falls
into the category of descriptive analytics. 90% of organizations today
use descriptive analytics which is the most basic form of analytics.
2. DIAGNOSTIC ANALYTICS
 At this stage, historical data can be measured against other
data to answer the question of why something happened.
 Companies go for diagnostic analytics, as it gives a deep insight
into a particular problem. At the same time, a company should
have detailed information at their disposal; otherwise data
collection may turn out to be individual for every issue and
time-consuming.
 Eg: Let’s take another look at the examples from different
industries: a healthcare provider compares patients’ response
to a promotional campaign in different regions; a retailer drills
the sales down to subcategories.
3. PREDICTIVE ANALYTICS
 Predictive analytics tells what is likely to be happen. It uses the
findings of descriptive and diagnostic analytics to detect tendencies,
clusters and exceptions, and to predict future trends, which makes it a
valuable tool for forecasting.
 Despite numerous advantages that predictive analytics brings, it is
essential to understand that forecasting is just an estimate, the
accuracy of which highly depends on data quality and stability of the
situation, so it requires a careful treatment and continuous
optimization.
 Eg: A management team can weigh the risks of investing in their
company’s expansion based on cash flow analysis and
forecasting. Organizations like Walmart, Amazon and other retailers
leverage predictive analytics to identify trends in sales based on
purchase patterns of customers, forecasting customer behavior,
forecasting inventory levels, predicting what products customers are
likely to purchase together so that they can offer personalized
recommendations, predicting the amount of sales at the end of the
quarter or year.
4. PRESCRIPTIVE ANALYTICS
 The purpose of prescriptive analytics is to literally prescribe what action to take to
eliminate a future problem or take full advantage of a promising trend. It is a
combination of data, mathematical models and various business rules.
 The data for prescriptive analytics can be both internal (within the organization)
and external (like social media data).
 Besides, prescriptive analytics uses sophisticated tools and technologies, like
machine learning, business rules and algorithms, which make it sophisticated to
implement and manage. That is why, before deciding to adopt prescriptive
analytics, a company should compare required efforts vs. an expected added value.
 Prescriptive analytics are comparatively complex in nature and many companies
are not yet using them in day-to-day business activities, as it becomes difficult to
manage. Large scale organizations use prescriptive analytics for scheduling the
inventory in the supply chain, optimizing production, etc. to optimize customer
experience.
 An example of prescriptive analytics: a multinational company was able to identify
opportunities for repeat purchases based on customer analytics and sales history.
NEED FOR BIG DATA ANALYTICS
 The new benefits that big data analytics brings to the table,
however, are speed and efficiency. Whereas a few years ago a
business would have gathered information, run analytics and
unearthed information that could be used for future
decisions, today that business can identify insights for
immediate decisions. The ability to work faster – and stay
agile – gives organizations a competitive edge they didn’t
have before.
 Big data analytics helps organizations harness their data and
use it to identify new opportunities. That, in turn, leads to
smarter business moves, more efficient operations, higher
profits and happier customers in the following ways:
NEED FOR BIG DATA ANALYTICS
 Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data – plus they
can identify more efficient ways of doing business.
 Faster, better decision making: With the speed of Hadoop and in-memory analytics,
combined with the ability to analyze new sources of data, businesses are able to
analyze information immediately – and make decisions based on what they’ve learned.
 New products and services: With the ability to gauge customer needs and satisfaction
through analytics comes the power to give customers what they want. Davenport
points out that with big data analytics, more companies are creating new products to
meet customers’ needs.
 End Users Can Visualize Data: While the business intelligence software market is
relatively mature, a big data initiative is going to require next-level data visualization
tools, which present BI data in easy-to-read charts, graphs and slideshows. Due to the
vast quantities of data being examined, these applications must be able to offer
processing engines that let end users query and manipulate information quickly—even
in real time in some cases.
Use case: What is the need of Hadoop
Problem: An e-commerce site XYZ (having 100 million
users) wants to offer a gift voucher of 100$ to its top 10
customers who have spent the most in the previous year.
Moreover, they want to find the buying trend of these
customers so that company can suggest more items related
to them.
Issues:
Huge amount of unstructured data which needs to be
stored, processed and analyzed.
Solution:
 Apache Hadoop is not only a storage system but is a platform
for data storage as well as processing.
 Storage: This huge amount of data, Hadoop uses HDFS
(Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed
fashion. It works on Write once, read many times principle.
 Processing: Map Reduce paradigm is applied to data
distributed over network to find the required output.
 Analyze: Pig, Hive can be used to analyze the data.
 Cost: Hadoop is open source so the cost is no more an issue.
INTRODUCTION TO HADOOP
Processing Big Data - Hadoop
Designed to answer the question: “How to
process big data with reasonable cost and time?”
Answer- We have a savior to deal with Big Data
challenges – its Hadoop.
Apache Hadoop
 Hadoop is an open-source software framework used for
storing and processing Big Data in a distributed manner on
large clusters of commodity hardware. Hadoop is licensed
under Apache Software Foundation (ASF).
 Created by Doug Cutting and Mike Cafarella in 2005.
 Cutting named the program after his son’s toy elephant.
 Hadoop with its distributed processing, handles large
volumes of structured and unstructured data more
efficiently than the traditional enterprise data warehouse.
Hadoop makes it possible to run applications on systems
with thousands of commodity hardware nodes, and to
handle thousands of terabytes of data. Organizations are
adopting Hadoop because it is an open source software
and can run on commodity hardware
Apache Hadoop
History of Hadoop
HADOOP’S DEVELOPERS
Doug Cutting
2005: Doug Cutting and Michael J.
Cafarella developed Hadoop to support
distribution for the Nutch search engine
project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
GOOGLE ORIGINS
2003
2004
2006
SOME HADOOP MILESTONES
• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of
data in 209 seconds, compared to previous record of 297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop
family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding
more computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari, Cassandra, Mahout have been added
HADOOP EVOLUTION IN ONE SHOT
FEATURES OF HADOOP
• Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• Suitable for Big Data Analysis
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
WHO USES HADOOP?
HADOOP FRAMEWORK TOOLS
HADOOP’S ARCHITECTURE
HADOOP’S ARCHITECTURE
HADOOP’S ARCHITECTURE
• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network
HADOOP’S ARCHITECTURE
NameNode:
• Stores metadata for the files, like the directory structure of a
typical FS.
• The server holding the NameNode instance is quite crucial,
as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary
after a DataNode failure
HADOOP’S ARCHITECTURE
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
HADOOP’S ARCHITECTURE:
MAPREDUCE ENGINE
HADOOP’S ARCHITECTURE
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”) and
sends it to the TaskTracker process in each node
• TaskTracker reports back to the JobTracker node and
reports on job progress, sends data (“Reduce”) or requests
new jobs
HADOOP’S ARCHITECTURE
• None of these components are necessarily limited to using
HDFS
• Many other distributed file-systems with quite different
architectures work
• Many other software packages besides Hadoop's
MapReduce platform make use of HDFS
WHY USE HADOOP?
 Need to process Multi Petabyte Datasets
 Data may not have strict schema
 Expensive to build reliability in each application
 Nodes fails everyday
 Need common infrastructure
 Very Large Distributed File System
 Assumes Commodity Hardware
 Optimized for Batch Processing
 Runs on heterogeneous OS

Contenu connexe

Tendances

Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
Editor IJCATR
 
GSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-EditionGSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-Edition
Gang Li
 
Supply chain management
Supply chain managementSupply chain management
Supply chain management
muditawasthi
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
Parag Kapile
 

Tendances (20)

Big data unit i
Big data unit iBig data unit i
Big data unit i
 
IRJET - Big Data: Evolution Cum Revolution
IRJET - Big Data: Evolution Cum RevolutionIRJET - Big Data: Evolution Cum Revolution
IRJET - Big Data: Evolution Cum Revolution
 
El big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex RayónEl big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex Rayón
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 
Big data.
Big data.Big data.
Big data.
 
Sample
Sample Sample
Sample
 
Societal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data StackSocietal Impact of Applied Data Science on the Big Data Stack
Societal Impact of Applied Data Science on the Big Data Stack
 
What is big data
What is big dataWhat is big data
What is big data
 
Data set Introduction to Big Data
Data set   Introduction to Big DataData set   Introduction to Big Data
Data set Introduction to Big Data
 
How does big data impact you
How does big data impact youHow does big data impact you
How does big data impact you
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFOBig Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFO
 
What is Big Data
What is Big Data What is Big Data
What is Big Data
 
GSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-EditionGSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-Edition
 
Big Data Challenges faced by Organizations
Big Data Challenges faced by OrganizationsBig Data Challenges faced by Organizations
Big Data Challenges faced by Organizations
 
Supply chain management
Supply chain managementSupply chain management
Supply chain management
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 

Similaire à Bigdata Hadoop introduction

Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
saranya270513
 

Similaire à Bigdata Hadoop introduction (20)

big-data.pdf
big-data.pdfbig-data.pdf
big-data.pdf
 
new.pptx
new.pptxnew.pptx
new.pptx
 
Big Data Analytics: Challenges And Applications For Text, Audio, Video, And S...
Big Data Analytics: Challenges And Applications For Text, Audio, Video, And S...Big Data Analytics: Challenges And Applications For Text, Audio, Video, And S...
Big Data Analytics: Challenges And Applications For Text, Audio, Video, And S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
BIG DATA ANALYTICS: CHALLENGES AND APPLICATIONS FOR TEXT, AUDIO, VIDEO, AND S...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Big Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptxBig Data Analytics_Unit1.pptx
Big Data Analytics_Unit1.pptx
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
Unit III.pdf
Unit III.pdfUnit III.pdf
Unit III.pdf
 
Know The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdfKnow The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdf
 

Dernier

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Dernier (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Bigdata Hadoop introduction

  • 2. Agenda What is Data & Big Data Sources of big data Facts and Figures of Big Data  Processing Big Data – Hadoop  Hadoop Introduction: Hadoop’s history and advantages Hadoop Architecture in detail Hadoop in industry
  • 3. Data: Data is the collection of raw facts and figures. Data is unprocessed, that is why data is called collection of raw facts and figures. Examples of Data:  Student data on admission form- bundle of admission forms contains name, father’s name, address, course, photograph etc.  Census Report, Data of citizens - During census, data of all citizens like number of persons living in a home, literate or illiterate, number of children, cast, religion etc.
  • 4. Information: Processed data is called information. When raw facts and figures are processed and arranged in some proper order then they become information. Information has proper meanings. Information is useful in decision-making. Examples of information:  Student’s address labels - Stored data of students can be used to print address labels of students. These address labels are used to send any intimation / information to students at their home addresses.  Census Report, Total Population - Census data is used to get report/information about total population of a country and literacy rate, total population of males, females, children, aged persons, persons in different categories line cast, religion, age groups etc.
  • 5. Data to Information: Ex: The data collected is in a survey report is: ‘HYD20M’ If we process the above data then we understand that code information about a person as follows: HYD is city name ‘Hyderabad’, 20 is age M is to represent ‘MALE’.
  • 7. International system of Units (SI) Kilobyte KB 103 Megabyte MB 106 Gigabyte GB 109 Terabyte TB 1012 Petabyte PB 1015 Exabyte EB 1018 Zettabyte ZB 1021 Yottabyte YB 1024 Units of data: When dealing with big data, we consider numbers to represent like megabytes, gigabytes, terabytes etc. Here is the system of units to represent data.
  • 8. What is Big Data?
  • 9. Big Data is a huge collection of data as the name refers “BIG DATA”. It can’t be processed by traditional methods because the most of the data generation is unstructured form. According to Gartner: Big data is huge-volume, fast-velocity, and different variety information assets that demand innovative platform for enhanced insights and decision making.
  • 10. Types of Big Data: 1. Structured Data: Any data that can be stored, accessed and processed in form of fixed format is termed as a 'structured' data. 2. Unstructured Data: Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. 3. Semi-Structured Data: Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a represented in XML file.
  • 12. 3Vs of Big Data: 1. Volume – Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data generated by humans, machines and their interactions on social media itself is massive. 2. Variety: Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. 3. Velocity: The term 'velocity' refers to the speed of generation of data. How fast the data generated and processed to meet the demands, determines real potential in the data.
  • 13. 3 V’S OF BIGDATA
  • 14. Sources of Big Data: Sensor networks Social media Public web Purchase records Medical records Airlines Scientific research So on……..
  • 15. Facts and Figures of Big Data: • The big data growth we’ve been witnessing is only natural. We constantly generate data. On Google alone, we submit 40,000 search queries per second. That amounts to 1.2 trillion searches yearly! • Each minute, 300 new hours of video show up on YouTube. That’s why there’s more than 1 billion gigabytes (1 exabyte) of data on its servers! • People share more than 100 terabytes of data on Facebook daily. Every minute, users send 31 million messages and view 2.7 million videos. • Big data usage statistics indicate people take about 80% of photos on their smartphones. Considering that only this year over 1.4 billion devices will be shipped worldwide, we can only expect this percentage to grow. • Smart devices (for example, fitness trackers, sensors, Amazon Echo) produce 5 quintillion bytes of data daily.
  • 17. Key Stats: •The big data market reaching an estimated value of $103 billion by 2023 •It’s estimated 97.2% of companies are starting to invest in big data technology •Every day, internet users create 2.5 quintillion bytes of data •IDC’s Digital Universe Study from 2012 found that just 0.5% of data was actually being analyzed. •It is estimated that by 2021 every person will generate about 1.7 megabytes of data per second. •Companies like Netflix leverages Big Data to save US$1 billion per year on customer retention. •80-90% of the data we generate today is unstructured.
  • 18. INTERESTING FACTS  According to studies, the human brain can store about 2.5 petabytes of data.  We generate 2.5 quintillion bytes of data daily.  By 2020, every person will generate 1.7 megabytes in just a second.
  • 19. DATA ANALYTICS Gartner is predicting that companies that aren’t investing heavily in analytics by the end of 2020 may not be in business in 2021. (It is assumed small businesses, such as self-employed handymen, gardeners, and many artists, are not included in this prediction.)
  • 20. DATA ANALYTICS  Data Analytics is the process of examining raw data (data sets) with the purpose of drawing conclusions about that information, increasingly with the aid of specialized systems and software.  Data Analytics involves applying an algorithmic or mechanical process to derive insights. For example, running through a number of data sets to look for meaningful correlations between each other.  It is used in a number of industries to allow the organizations and companies to make better decisions as well as verify and disprove existing theories or models.  The focus of Data Analytics lies in inference, which is the process of deriving conclusions that are solely based on what the researcher already knows.
  • 21. Big Data and Analytics • Surprisingly, 99.5% of collected data never gets used or analysed. So much potential wasted! • Less than 50% of the structured data collected from IoT is used in decision making. • Predictive analytics are becoming more and more crucial for success. 79% of executives believe that failing to embrace big data will lead to bankruptcy. This explains why 83% of companies invest in big data projects. • Fortune 1000 companies can gain more than $65 million additional net income, only by increasing their data accessibility with 10%. • Healthcare could also vastly benefit from big data analytics adoption. As much as $300 billion can be saved yearly! • Companies that harness big data’s full power could increase their operation margins by up to 60%!
  • 22. TYPES OF DATA ANALYTICS
  • 23. 1. DESCRIPTIVE ANALYTICS:  The simplest way to define descriptive analytics is that, it answers the question “What has happened?”  This type of analytics, analyses the data coming in real-time and historical data for insights on how to approach the future.  The main objective of descriptive analytics is to find out the reasons behind precious success or failure in the past.  The ‘Past’ here, refers to any particular time in which an event had occurred and this could be a month ago or even just a minute ago.  The vast majority of big data analytics used by organizations falls into the category of descriptive analytics. 90% of organizations today use descriptive analytics which is the most basic form of analytics.
  • 24. 2. DIAGNOSTIC ANALYTICS  At this stage, historical data can be measured against other data to answer the question of why something happened.  Companies go for diagnostic analytics, as it gives a deep insight into a particular problem. At the same time, a company should have detailed information at their disposal; otherwise data collection may turn out to be individual for every issue and time-consuming.  Eg: Let’s take another look at the examples from different industries: a healthcare provider compares patients’ response to a promotional campaign in different regions; a retailer drills the sales down to subcategories.
  • 25. 3. PREDICTIVE ANALYTICS  Predictive analytics tells what is likely to be happen. It uses the findings of descriptive and diagnostic analytics to detect tendencies, clusters and exceptions, and to predict future trends, which makes it a valuable tool for forecasting.  Despite numerous advantages that predictive analytics brings, it is essential to understand that forecasting is just an estimate, the accuracy of which highly depends on data quality and stability of the situation, so it requires a careful treatment and continuous optimization.  Eg: A management team can weigh the risks of investing in their company’s expansion based on cash flow analysis and forecasting. Organizations like Walmart, Amazon and other retailers leverage predictive analytics to identify trends in sales based on purchase patterns of customers, forecasting customer behavior, forecasting inventory levels, predicting what products customers are likely to purchase together so that they can offer personalized recommendations, predicting the amount of sales at the end of the quarter or year.
  • 26. 4. PRESCRIPTIVE ANALYTICS  The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future problem or take full advantage of a promising trend. It is a combination of data, mathematical models and various business rules.  The data for prescriptive analytics can be both internal (within the organization) and external (like social media data).  Besides, prescriptive analytics uses sophisticated tools and technologies, like machine learning, business rules and algorithms, which make it sophisticated to implement and manage. That is why, before deciding to adopt prescriptive analytics, a company should compare required efforts vs. an expected added value.  Prescriptive analytics are comparatively complex in nature and many companies are not yet using them in day-to-day business activities, as it becomes difficult to manage. Large scale organizations use prescriptive analytics for scheduling the inventory in the supply chain, optimizing production, etc. to optimize customer experience.  An example of prescriptive analytics: a multinational company was able to identify opportunities for repeat purchases based on customer analytics and sales history.
  • 27.
  • 28. NEED FOR BIG DATA ANALYTICS  The new benefits that big data analytics brings to the table, however, are speed and efficiency. Whereas a few years ago a business would have gathered information, run analytics and unearthed information that could be used for future decisions, today that business can identify insights for immediate decisions. The ability to work faster – and stay agile – gives organizations a competitive edge they didn’t have before.  Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers in the following ways:
  • 29. NEED FOR BIG DATA ANALYTICS  Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing business.  Faster, better decision making: With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately – and make decisions based on what they’ve learned.  New products and services: With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. Davenport points out that with big data analytics, more companies are creating new products to meet customers’ needs.  End Users Can Visualize Data: While the business intelligence software market is relatively mature, a big data initiative is going to require next-level data visualization tools, which present BI data in easy-to-read charts, graphs and slideshows. Due to the vast quantities of data being examined, these applications must be able to offer processing engines that let end users query and manipulate information quickly—even in real time in some cases.
  • 30. Use case: What is the need of Hadoop Problem: An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10 customers who have spent the most in the previous year. Moreover, they want to find the buying trend of these customers so that company can suggest more items related to them. Issues: Huge amount of unstructured data which needs to be stored, processed and analyzed.
  • 31. Solution:  Apache Hadoop is not only a storage system but is a platform for data storage as well as processing.  Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle.  Processing: Map Reduce paradigm is applied to data distributed over network to find the required output.  Analyze: Pig, Hive can be used to analyze the data.  Cost: Hadoop is open source so the cost is no more an issue.
  • 32. INTRODUCTION TO HADOOP Processing Big Data - Hadoop Designed to answer the question: “How to process big data with reasonable cost and time?” Answer- We have a savior to deal with Big Data challenges – its Hadoop.
  • 33. Apache Hadoop  Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under Apache Software Foundation (ASF).  Created by Doug Cutting and Mike Cafarella in 2005.  Cutting named the program after his son’s toy elephant.  Hadoop with its distributed processing, handles large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse. Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. Organizations are adopting Hadoop because it is an open source software and can run on commodity hardware
  • 35. HADOOP’S DEVELOPERS Doug Cutting 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  • 37. SOME HADOOP MILESTONES • 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds) • 2009 - Avro and Chukwa became new members of Hadoop family • 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework • 2011 - ZooKeeper Completed • 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha. - Ambari, Cassandra, Mahout have been added
  • 39. FEATURES OF HADOOP • Abstract and facilitate the storage and processing of large and/or rapidly growing data sets • Structured and non-structured data • Simple programming models • Suitable for Big Data Analysis • High scalability and availability • Use commodity (cheap!) hardware with little redundancy • Fault-tolerance • Move computation rather than data
  • 44. HADOOP’S ARCHITECTURE • Hadoop Distributed Filesystem • Tailored to needs of MapReduce • Targeted towards many reads of filestreams • Writes are more costly • High degree of data replication (3x by default) • No need for RAID on normal nodes • Large blocksize (64MB) • Location awareness of DataNodes in network
  • 45. HADOOP’S ARCHITECTURE NameNode: • Stores metadata for the files, like the directory structure of a typical FS. • The server holding the NameNode instance is quite crucial, as there is only one. • Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata. • Handles creation of more replica blocks when necessary after a DataNode failure
  • 46. HADOOP’S ARCHITECTURE DataNode: • Stores the actual data in HDFS • Can run on any underlying filesystem (ext3/4, NTFS, etc) • Notifies NameNode of what blocks it has • NameNode replicates blocks 2x in local rack, 1x elsewhere
  • 48.
  • 49. HADOOP’S ARCHITECTURE MapReduce Engine: • JobTracker & TaskTracker • JobTracker splits up data into smaller tasks(“Map”) and sends it to the TaskTracker process in each node • TaskTracker reports back to the JobTracker node and reports on job progress, sends data (“Reduce”) or requests new jobs
  • 50. HADOOP’S ARCHITECTURE • None of these components are necessarily limited to using HDFS • Many other distributed file-systems with quite different architectures work • Many other software packages besides Hadoop's MapReduce platform make use of HDFS
  • 51. WHY USE HADOOP?  Need to process Multi Petabyte Datasets  Data may not have strict schema  Expensive to build reliability in each application  Nodes fails everyday  Need common infrastructure  Very Large Distributed File System  Assumes Commodity Hardware  Optimized for Batch Processing  Runs on heterogeneous OS