3. Data is growing at a exponential rate and traditional tools like
RDBMS is not enough to process
4. Data is everywhere:
• Flickr (87 million registered members and 3.5 million
photos per day)
• YouTube (4B videos streamed per day)
• Yahoo! Webmap (3 trillion links, 300TB compressed,
5PB disk)
• Facebook is collecting your data 500 terabytes a day
• Walmart handles more than 1 million customer
transactions every hour
• IDC Estimates that by 2020, business transactions on
the internet- business-to-business and business-to-
consumer – will reach 450 billion per day.
5. Data is growing at a 40% rate, reaching nearly 45 ZB by 2020
according to IDC
1 ZB is equal to 1 billion TB
6. What is Big Data and what is not?
• Order details of a e-commerce site
• All Orders across 1000s of e-commerce sites
• One person’s voter ID information
• Every citizen’s voter ID information dataset
Simple Definition: Big Data is Data, that is too big to
process with a single machine
9. Types of Data:
• Relational Data (Tables/Transaction/Legacy
Data)
• Unstructured Data – Apache weblogs
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF)
• Streaming Data
10. Data Processing Tasks:
• Aggregation and Statistics - Data warehouse
• Contextual Advertising – Real Time Bidding,
Remarketing
• Indexing, Searching, and Querying - Keyword
based search, Pattern recognition
• Knowledge discovery - Data Mining, Statistical
Modeling
11. Traditional Architecture
• Relational Data is everything
– SQL
– Embedded
– Client-Server Based
• Data Stack
– Web, CDN, Load Balancers, Application, Database
and Storage
12. Traditional Scalability
• Scale-up
– Memory And Hardware has limitations
• Scale-out
– Reading
• Cache is everything
– Query Cache
– Memcache
• Pre-fetching, Replication
– Writes
• Redundant Disk Arrays, RAID
• Sharding
13. NoSQL Solution
• Lot of companies emerged to solve data problem
• Big Table: Google started to implement massively
distributed scalable system
• Many companies followed building scale-out
architecture using commodity hardware
• ACID was termed as bad for scaling, so relaxed
consistency model came
• Google Big Table and Amazon Dynamo are
notable