3. BIG DATA
• The data comes from everywhere: sensors used to
gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records,
and cell phone GPS signals to name a few. This data
is called Big Data.
• Every day, we create 2.5 quintillion bytes (one quintillion
bytes = one billion gigabytes). Of all data, so much of
90% of the data in the world today has been created in
the last two years alone.
4. IN FACT, IN A MINUTE…
• Email users send more than 204 million messages;
• Mobile Web receives 217 new users;
• Google receives over 2 million search queries;
• YouTube users upload 48 hours of new video;
• Facebook users share 684,000 bits of content;
• Twitter users send more than 100,000 tweets;
• Consumers spend $272,000 on Web shopping;
• Apple receives around 47,000 application downloads;
• Brands receive more than 34,000 Facebook 'likes';
• Tumblr blog owners publish 27,000 new posts;
• Instagram users share 3,600 new photos;
• Flickr users , on the other hand , add 3,125 new photos;
• Foursquare users perform 2,000 check-ins;
• WordPress users publish close to 350 new blog posts.
5. Big Data Vectors
• High-volume:
Amount of data
• High-velocity:
Speed rate in collecting or acquiring or generating or
processing of data
• High-variety:
Different data type such as audio, video, image data
Big Data = Transactions + Interactions + Observations
6. What is Hadoop?
• HADOOP
High-availability distributed object-oriented platform or
“Hadoop” is a software framework which analyze structured
and unstructured data and distribute applications on different
servers.
• Basic Application of Hadoop
Hadoop is used in maintaining, scaling, error handling,
self healing and securing large scale of data. These data can
be structured or unstructured. What I mean to say is if data is
large then traditional systems are unable to handle it.
8. DIFFERENT COMPONENTS ARE..........
Data Access Components :- PIG & HIVE
Data Storage Components :- HBASE
Data Integration Components :- APACHEFLUME ,SQOOP, CHUKWA.
Data Management Components :- AMBARI , ZOOKEEPER.
Data Serialization Components :- THRIFT & AVRO
Data Intelligence Components :- APACHE MAHOUT, DRILL
9. What does it do?
• Hadoop implements Google’s MapReduce, using
HDFS
• MapReduce divides applications into many small
blocks of work.
• HDFS creates multiple replicas of data blocks for
reliability, placing them on compute nodes
around the cluster.
• MapReduce can then process the data where it
is located.
• Hadoop ‘s target is to run on clusters of the order
of 10,000-nodes.
10. How does MapReduce work?
• The run time partitions the input and provides it
to different Map instances;
• Map (key, value) (key’, value’)
• The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so
that each Reduce function gets the pairs with the
same key’.
• Each Reduce produces a single (or zero) file
output.
• Map and Reduce are user written functions.
11. HYPERTABLE
What is it?
• Open source Big table clone
• Manages massive sparse tables with timestamped cell
versions
• Single primary key index
What is it not?
• No joins
• No secondary indexes (not yet)
• No transactions (not yet)
16. RANGE SERVER
• Manages ranges of table data
• Caches updates in memory (Cell Cache)
• Periodically spills (compacts) cached updates to disk (CellStore)
17. PERFORMANCE OPTIMIZATIONS
Block Cache
• Caches CellStore blocks
• Blocks are cached uncompressed
Bloom Filter
• Avoids unnecessary disk access
• Filter by rows or rows + columns
• Configurable false positive rate
Access Groups
• Physically store co-accessed columns together
• Improves performance by minimizing I/O
18. ADVANTAGES
• Flexible : Easily to access Structured & Unstructured
Data
• Scalable: It can store & distributed very large data , sets
100’s of inexpensive Servers that Operate in Parallel.
• Efficient: By distributing the data, it can process it in
parallel on the nodes where the data is located.
• Resistant to Failure: It automatically maintains
multiple copies of data and automatically redeploys
computing tasks based on failures.