Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
2. Module 1
• What is Big Data?
• Hadoop Ecosystem Components
• Hadoop Architecture
• Hadoop Storage: HDFS
• Hadoop Processing: MapReduce Framework
• Hadoop Server Roles: NameNode, DataNode, Secondary NameNode
• Anatomy of File Read and Write
3. What is Big Data?
• Walmart handles more than one million customer transactions every
hour.
• Facebook handles 40 billion photos from its user base.
• New York Stock Exchange generates about one TB of new trade data
per day.
• Last.fm hosts approximately 25 million users, taking up one TB of
storage daily.
• Twitter generates 7 TB of data daily.
• IBM claims 90% of today’s stored data was generated in last two
years.
4. Three Characteristics of Big Data V3s
• Volume(Data quantity)
• Facebook ingests 500 TB of new data every day.
• Boeing 737 will generate 240 TB of flight data during a single journey.
• Velocity(Data Speed)
• High Frequency stock trading algorithms reflect market changes within
microseconds.
• Clickstreams capture user behavior at millions of events per second.
• Variety(Data Types)
• Geospatial data, Audio and Video, unstructured text.
5. The structure of Big Data
• Structured
• CSV, Data stored in RDBMS
• Semi-Structured
• XML, JSON, SGML
• Unstructured
• Video data, Audio Data, Images
6. How Big Data impacts on IT?
• By 2016 4.4 million IT jobs in Big Data; 1.9 million is in US itself
• India will require a minimum of 1 lakh data scientist in the next
couple of years in addition to data analysts and data managers to
support the Big Data space.
• The opportunity for Indian Service providers lies in offering services
around Big Data implementation and analytics for global
multinationals.
8. What is Hadoop?
• Apache™ Hadoop® is an open source software project that enables
the distributed processing of large data sets across clusters of
commodity servers.
• It is designed to scale up from a single server to thousands of
machines, with a very high degree of fault tolerance.
• Rather than relying on high-end hardware, the resiliency of these
clusters comes from the software’s ability to detect and handle
failures at the application layer.
10. Hadoop Ecosystem
• Pig: A scripting language that simplifies the creation of MapReduce
jobs and excels at exploring and transforming data.
• Hive: Provides SQL like access to your Big Data.
• HBase: A Hadoop database.
• Sqoop: For efficiently transferring bulk data between Hadoop and
relation databases.
• Oozie: A workflow scheduler system to manage Apache Hadoop jobs.
• Flume: For efficiently collecting, aggregating, and moving large
amounts of log data.
12. Hadoop – Core Components
• HDFS - A file system that spans all the nodes in a Hadoop cluster for
data storage. It links together the file systems on many local nodes to
make them into one big file system. HDFS assumes nodes will fail, so
it achieves reliability by replicating data across multiple nodes
• Map/Reduce – The data processing framework that understands and
assigns work to the nodes in a cluster.