3. Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
– Sensor data
– IoT data
3
9. Big Data
• Exabyte , Zettabyte of data
• Big Data is not about the size of the data,
it’s about the value within the Big Data
Big Data
9
10. Data in an Enterprise
• Existing OLTP Databases
• User Generated Data
• Logs
• System generated data
10
11. The Structure of Big Data
• Structured
– Most traditional data sources
• Semi-structured
– XML,JSON
• Unstructured
– FB logs, web chats, Youtube
11
12. What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
12
13. Challenges with Big Data
• Data Quality: 4th V i.e. Veracity.
• Discovery: Finding insights on Big Data is like finding a
needle in a haystack
• Storage:
– “Where to store it?”.
– Need to scale up or down on-demand.
• Analytics
– unaware of the kind of data we are dealing with, so
analyzing that data is even more difficult.
• Security
• Lack of Talent
13
15. Scale up vs Scale out
• Harder and more expensive to scale-up
• Typically “scaled-up” (not scaling-out) by getting
bigger/more powerful hardware
15
16. Apache Hadoop
• The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of commodity hardware.
• Concept Big Data
• Technique MapReduce
• Hadoop is Eco System Framework which is developed
in java.
16
18. Why Hadoop
• An open source project to manage “Big Data”
• Not just a single project, but a set of projects
• that work together
• Deals with the 4 V’s
• Traditional data stores are expensive to scale and by
design difficult to distribute
• Transforms commodity hardware to Coherent
storage service that lets you store petabytes of data
• Coherent processing service to process data
efficiently
18
20. • In 2003, Doug Cutting launches project Nutch to
handle billions of searches and indexing millions of
web pages.
• Later in Oct 2003 – Google releases papers with GFS
(Google File System).
• In Dec 2004, Google releases papers with
MapReduce.
• In 2005, Nutch used GFS and MapReduce to perform
operations
• In 2006, Yahoo created Hadoop based on GFS and
MapReduce with Doug Cutting and team.
• In 2007 Yahoo started using Hadoop on a 1000 node
cluster.
Hadoop-History
22. MapReduce
• Is processing framework
• Java based
• Is for batch processing
• High performance, fault tolerance data
processing system
22
23. 23
MapReduce in 41 words
Goal: count the number of books in the library.
• Map:
– You count up shelf #1,
– I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual
counts.
(Cf. http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html)
24. • MapReduce is a programming model for processing and
generating large data sets
• MapReduce was used to completely regenerate Google's
index of the World Wide Web.
• Hadoop which allows applications to run using the
MapReduce algorithm.
MapReduce
• Users implement interface of 2 function
– Map
– Reduce
• Map( in-key,in-value) (Out-key,intermediate-value) list
• Reduce(Out-key,intermediate-value list) out_value list
24
27. • Contains Libraries and other modules
Hadoop
Common
• Hadoop Distributed File SystemHDFS
• Yet Another Resource NegotiatorHadoop YARN
• A programming model for large scale
data processing
Hadoop
MapReduce
Apache Hadoop-Modules
31. Yarn
• New processing framework
• High availability
• YARN supports multiple processing models in
addition to MapReduce
• With Yarn we can process non mapreduce jobs
31
32. Server Types
• OLTP (Online Transaction Processing): data
keep on changing
• OLAP(Online Analytical Process)
– Facebook, Google, Twitter, LinkedIn, Ecommerce
site
32
36. Apache Flume
• Flume is a framework for populating Hadoop
with data.
• Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating,
and moving large amounts of log data.
36
37. Sqoop
• Apache Sqoop is a connectivity tool designed
for efficiently transferring bulk data between
Apache Hadoop and structured data stores
such as relational databases.
37
38. Kafka
38
Kafka® is used for building real-time data
pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, very fast, and runs in
production in thousands of companies.
39. Arrange Data
• Hadoop Distributed File System (HDFS)
• NOSQL HBase, MongoDB, Cassandra
• NOSQL is adding transactional behavior to
data (OLTP behavior)
39
40. Arrange Data
• HDFS is a distributed file system designed to
run on commodity hardware.
• HDFS: Anything you save in HDFS is file
40
42. Spark
• In memory process
• Live streaming processing
• Machine learning
42
43. Pig
• Initially developed by yahoo
• Platform for analyzing large data sets that
consist of high level language for expressing
data analysis programs
• Infrastructure compile language to a sequence
of MapReduce programs
43
45. Twitter
• Twitter moved to Apache Pig for analysis. Now,
– joining data sets,
– grouping them,
– sorting them and
– retrieving data
becomes easier and simpler. You can see in the below
image how twitter used Apache Pig to analyse their
large data set.
45
47. Apache Hive
• Apache Hive is a data warehouse system built on
top of Hadoop and is used for analyzing structured
and semi-structured data.
• Compile, SQL-like queries into MapReduce
programs
47
48. • Challenges at Facebook: Exponential Growth
of Data
• Hive project was open sourced in August’
2008 by Facebook and is freely available as
Apache Hive today.
48
Story of Hive – From Facebook to
Apache
51. NASA Case Study: Regional Climate
Model Evaluation System (RCMES)
51
• MySQL database with 6 billion tuples of the form (latitude, longitude, time, data point value,
height)
• Even after dividing the whole table into smaller subsets, the system generated huge
overhead while processing the data.
53. HBase
• HBase is an open source, multidimensional,
distributed, scalable and a NoSQL database
written in Java.
• Facebook Messaging Platform shifted from
Apache Cassandra to HBase in November
2010
• Facebook Messenger combines Messages,
email, chat and SMS into a real-time
conversation
53
54. Apache Mahout
• Machine learning library to build scalable
machine learning algorithms implemented on
top of mapreduce.
54
55. Decide
• Data visualization
– Dashboards, graphs, charts
• Can take business decision
• BI
• HUE
• Tabview
• clickview
• MS-excel
55
56. HUE
• Hue is an open-source Web interface that
supports Apache Hadoop and its ecosystem
56